Thankyou Michael, I got Rbowtie working, now functionalizing it for use within multicrispr. I noticed that in QuasR, you actually create a package with bowtie indices which you then use for future purposes. Interesting workflow, think I will make use of that functionality.
Thankyou Herve. Yes, parallellizing would speed up things. I use `vcountPDict` because I want to do the offtargetanalysis for a set of 23 bp cas9 sites. vcountPDict must be more efficient than looping, I thought, maybe this is only marginally so, I noticed there's an sapply underlying vcountPDict. Is there a BSgenome way to parallellize, like a parallel bsapply or so? And Rsubread I concluded is really limited to only a small number of co-alignments, and so not suited for offtargetanalysis. Cheers, Aditya ________________________________________ From: Pages, Herve [hpa...@fredhutch.org] Sent: Friday, November 08, 2019 7:19 PM To: Bhagwat, Aditya; firstname.lastname@example.org Cc: Wei Shi (s...@wehi.edu.au); Michael Stadler (michael.stad...@fmi.ch) Subject: Re: From Biostring matching to short read mapping Hi Aditya, Should not be too hard to parallelize. With some gotchas: using one worker per chromosome (which is the easy way to go) wouldn't be optimal because of the size differences between the chromosomes. So a better approach is to try to give each worker the same amount of work by splitting the set of chromosomes in groups of more or less equal sizes. The split can either preserve full chromosomes or break them in smaller pieces. The later will allow using a lot more workers than the former. I'll try to come up with some code that I'll share here. BTW the *PDict() family in Biostrings is for finding the matches of a collection of patterns. You say you want to find "all genomic (mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not using vmatchPattern() (or vcountPattern()) for that? Cheers, H. On 11/7/19 02:11, Bhagwat, Aditya wrote: > Dear bioc-devel, > > multicrispr > <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.gwdg.de_loosolab_software_multicrispr&d=DwMFAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=B3ZdDoy-Ur4VIfZr68ORA8dplv90DuCcehJEWpkwWUU&s=UsUGsKc2SVyrBHDWnEJS0FVy1wIhoeq2WA4nlLmtmfo&e=> > provides > functions for Crispr/Cas9 gRNA design (and is being prepared for BioC). > One task involves finding all genomic (mis)matches of a 23-bp candidate > Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an > approach that is successful, though not fast. An alternative would be to > switch to short read mapping rather than (Bio)string matching, which > involves a one-time indexing effort, but subsequent fast alignment. > > `Rsubread::align` seems to be limited to max. 16 `nBestLocations`, > whereas I know from vcountPDict that some Cas9 candidates have hundreds > of genomic matches. > > `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit > on `maxHits`. > > Feedback request… > > Michael, would QuasR/(R)bowtie be a good approach to do this? > > Wei, did I overlook a way to do this with Rsubread? > > Herve, is there an elegant way to speed up vcountPDict (parallelize?) > > Thankyou J > > Aditya > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Biocemail@example.com mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel