I do have a bunch of genes ( nearly ~50000) from the whole genome, which
read in genomic ranges
A range(gene) can be seem as an observation has three columns chromosome,
start and end, like that
seqnames start end width strand
gene1 chr1 1 5 5 +
gene2 chr1 10 15 6 +
gene3 chr1 12 17 6 +
gene4 chr1 20 25 6 +
gene5 chr1 30 40 11 +
I just wondering is there an efficient way to find *overlapped, upstream
and downstream genes for each gene in the granges*
For example, assuming all_genes_gr is a ~50000 genes genomic range, the
result I want like belows:
gene_name upstream_gene downstream_gene overlapped_gene
gene1 NA gene2 NA
gene2 gene1 gene4 gene3
gene3 gene1 gene4 gene2
gene4 gene3 gene5 NA
Currently , the strategy I use is like that,
library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
#cat(idx, "\n")
curr_gene <- all_genes_gr[idx]
other_genes <- all_genes_gr[-idx]
n <- countOverlaps(curr_gene, other_genes)
gene <- subsetByOverlaps(curr_gene, other_genes)
return(list(n, gene))
}
system.time(lapply(1:100, function(idx) find_overlapped_gene(idx,
all_genes_gr)))
However, for 100 genes, it use nearly ~8s by system.time().That means if I
had 50000 genes, nearly one hour for just find overlapped gene.
I am just wondering any algorithm or strategy to do that efficiently,
perhaps 50000 genes in ~10min or even less
Yao He
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.