Re: [Bioc-devel] plyranges group_by

Stuart Lee Wed, 16 Oct 2019 18:02:59 -0700

Currently, the way grouping indices are generated is pretty slow if you’re 
doing stuff rowwise. Michael’s suggestion for using selfmatch should speed 
things up a bit. What are you planning to do after grouping? I’ve found there’s 
usually to do stuff without rowwise grouping but really depends on what you’re 
after. Re your other issue would you mind putting it on as a GitHub issue.
—
Stuart Lee
Visiting PhD Student - Ritchie Lab




On 16 Oct 2019, at 22:54, Michael Lawrence 
<lawrence.mich...@gene.com<mailto:lawrence.mich...@gene.com>> wrote:

Just a note that in this particular case, selfmatch(annotatedsrf) would be a 
fast way to generate a grouping vector, like plyranges::group_by(annotatedsrf, 
selfmatch(annotatedsrf)).

Michael

On Wed, Oct 16, 2019 at 2:48 AM Bhagwat, Aditya 
<aditya.bhag...@mpi-bn.mpg.de<mailto:aditya.bhag...@mpi-bn.mpg.de>> wrote:
Hi Stuart, Michael,

Your plyranges package is really cool - now I am using it for left joining 
GRanges (I am facing a minor issue 
there<https://support.bioconductor.org/p/125623/>, but that is not the topic of 
this email - I have been asked by Lori not to double-post :-)).

This email is about the plyranges functionality for grouping GRanges.
That is cool, but I found it to be not so performant for large numbers of 
ranges.
My R session hangs when I do:

bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis',
                      '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
srfranges <- rtracklayer::import.bed(bedfile, genome = 'mm10')
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
    generanges <- GenomicFeatures::genes(txdb)
annotatedsrf <- plyranges::join_overlap_left(srfranges, generanges)
plyranges::group_by(annotatedsrf, seqnames, start, end, strand)

For my purposes, I worked around it by performing a groupby in data.table:

data.table::as.data.table(annotatedsrf)[
    !is.na<http://is.na/>(gene_id),
    gene_id := paste0(gene_id, collapse = ';'),
    by = c('seqnames', 'start', 'end', 'strand'))

And was wondering, in general, whether it would be useful to have a 
data.table-based backend for plyranges::groupby()
And, whether all of this is actually a on-issue due to my improper use of 
plyranges::group_by properly.

Thank you for feebdack :-)

Aditya




--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
micha...@gene.com<mailto:micha...@gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube


_______________________________________________

The information in this email is confidential and intended solely for the 
addressee.
You must not disclose, forward, print or use it without the permission of the 
sender.

The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the 
Kulin
Nation as the traditional owners of the land where our campuses are located and
the continuing connection to country and community.
_______________________________________________

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] plyranges group_by

Reply via email to