In general, I am trying to get a handle on the capabilities of BioStrings and related packages.

My particular question is whether there is a slick (and efficient) way to construct a single object - along the lines of the XStringsViews class - that contains info about restriction sites and N-gaps using functions in Biostrings or other bioC packages.

Some background:

I am considering retooling a pipeline I have for processing collections of retroviral integration sites, and I am weighing using bioConductor packages in the new version.

An early step in this pipeline involves finding the nearest restriction site given the direction from which a fragment of DNA was sequenced (which depends on the orientation/strand of the retroviral construct) and given a genomic location( Chromo, Position ).

Currently this is done by putting together a list of the locations of restriction sites for an enzyme (by piping the FASTA to a simple filter written in C) importing the list of gaps in the FASTA from UCSC's annotation database, then doing lookups (ala findInterval) on the lists and returning info on whether the upstream site could be found (i.e. there was no intervening gap in the FASTA) and where it was.

I see a way to do this with matchPattern(), but it requires a bunch of calls and some cleanup afterwards. Specifically, I would do this something like this: I'd load up BSgenome.Hsapiens.UCSC.hg18, then use matchPattern( "TTAA", Hsapiens[[ chromo.i ]] ) to find MSEI restriction sites (for example), then matchPattern("AN", Hsapiens[[ chromo.i ]] ) to find all the gaps that follow an 'A', matchPattern("NA", Hsapiens[[ chromo.i ]] ) to find all gaps that are terminated by an 'A', and so on. Then using start() on each of the objects created, I can put together a data structure like the one I described in the previous paragraph and then use findInterval to do lookups.

I'd be interested to know if there is a more _direct_ way to construct the list of gaps or a unified list of sites and gaps with an annotation to say which is which.

Also, I would be interested to know if there is an efficient way to do the lookup directly (for a given Chromo, Position, and Strand return the restriction site) without the intermediate step of constructing the list of all restriction sites. Typically, runs involve from the low thousands to tens of thousands of retroviral integration events (read: genomic loci), but with higher throughput sequencing just around the corner, I expect this will rise. So, a one-liner that takes a couple of seconds to run for each event is impractical.

Can I count on start( my.xstringsviews.object ) being unique and in order?


TIA,

Chuck


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]                  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to