I replied on the support site. Let's move the discussion there. On Thu, Oct 17, 2019 at 1:24 AM Bhagwat, Aditya < aditya.bhag...@mpi-bn.mpg.de> wrote:
> Thank you Stuart and Michael for your feedback. > > Stuart, in response to your request for more context regarding my use > case, I have updated my recent BioC support post > <https://support.bioconductor.org/p/125623/>, now providing all use-case > details. > > Michael, I didn't selfmatch yet, but Stuart's reply seems to suggest that > it would not get the data.table performance (which is literally > instantaneous). > > As a general question, do you think it would be useful to add a > data.table-based split-apply-combine functionality to plyranges (such that > end user operations remain on GRanges-only)? I wouldn't mind writing a > function to do that (in github), but first need your feedback as to whether > you think that would be useful :-) > > Aditya > > > ------------------------------ > *From:* Stuart Lee [le...@wehi.edu.au] > *Sent:* Thursday, October 17, 2019 3:01 AM > *To:* Michael Lawrence > *Cc:* Bhagwat, Aditya; bioc-devel@r-project.org > *Subject:* Re: plyranges group_by > > Currently, the way grouping indices are generated is pretty slow if you’re > doing stuff rowwise. Michael’s suggestion for using selfmatch should speed > things up a bit. What are you planning to do after grouping? I’ve found > there’s usually to do stuff without rowwise grouping but really depends on > what you’re after. Re your other issue would you mind putting it on as a > GitHub issue. > — > Stuart Lee > Visiting PhD Student - Ritchie Lab > > > > On 16 Oct 2019, at 22:54, Michael Lawrence <lawrence.mich...@gene.com> > wrote: > > Just a note that in this particular case, selfmatch(annotatedsrf) would > be a fast way to generate a grouping vector, like > plyranges::group_by(annotatedsrf, selfmatch(annotatedsrf)). > > Michael > > On Wed, Oct 16, 2019 at 2:48 AM Bhagwat, Aditya < > aditya.bhag...@mpi-bn.mpg.de> wrote: > >> Hi Stuart, Michael, >> >> Your plyranges package is really cool - now I am using it for left >> joining GRanges (I am facing a minor issue there >> <https://support.bioconductor.org/p/125623/>, but that is not the topic >> of this email - I have been asked by Lori not to double-post :-)). >> >> This email is about the plyranges functionality for grouping GRanges. >> That is cool, but I found it to be not so performant for large numbers of >> ranges. >> My R session hangs when I do: >> >> bedfile <- paste0(' >> https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis', >> '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed') >> srfranges <- rtracklayer::import.bed(bedfile, genome = 'mm10') >> txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene >> generanges <- GenomicFeatures::genes(txdb) >> annotatedsrf <- plyranges::join_overlap_left(srfranges, generanges) >> plyranges::group_by(annotatedsrf, seqnames, start, end, strand) >> >> For my purposes, I worked around it by performing a groupby in data.table: >> >> data.table::as.data.table(annotatedsrf)[ >> !is.na(gene_id), >> gene_id := paste0(gene_id, collapse = ';'), >> by = c('seqnames', 'start', 'end', 'strand')) >> >> And was wondering, in general, whether it would be useful to have a >> data.table-based backend for plyranges::groupby() >> And, whether all of this is actually a on-issue due to my improper use >> of plyranges::group_by properly. >> >> Thank you for feebdack :-) >> >> Aditya >> >> >> > > -- > Michael Lawrence > Scientist, Bioinformatics and Computational Biology > Genentech, A Member of the Roche Group > Office +1 (650) 225-7760 > micha...@gene.com > > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube > > > _______________________________________________ > > The information in this email is confidential and inte...{{dropped:26}} _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel