[Bioc-sig-seq] Finding Restriction sites and gaps

Charles C. Berry Sun, 24 Aug 2008 16:42:59 -0700

In general, I am trying to get a handle on the capabilities of BioStringsand related packages.

My particular question is whether there is a slick (and efficient) way toconstruct a single object - along the lines of the XStringsViews class -that contains info about restriction sites and N-gaps using functions inBiostrings or other bioC packages.


Some background:

I am considering retooling a pipeline I have for processing collections ofretroviral integration sites, and I am weighing using bioConductorpackages in the new version.

An early step in this pipeline involves finding the nearest restrictionsite given the direction from which a fragment of DNA was sequenced (whichdepends on the orientation/strand of the retroviral construct) and given agenomic location( Chromo, Position ).

Currently this is done by putting together a list of the locations ofrestriction sites for an enzyme (by piping the FASTA to a simple filterwritten in C) importing the list of gaps in the FASTA from UCSC'sannotation database, then doing lookups (ala findInterval) on the listsand returning info on whether the upstream site could be found (i.e. therewas no intervening gap in the FASTA) and where it was.

I see a way to do this with matchPattern(), but it requires a bunch ofcalls and some cleanup afterwards. Specifically, I would do this somethinglike this: I'd load up BSgenome.Hsapiens.UCSC.hg18, then use matchPattern("TTAA", Hsapiens[[ chromo.i ]] ) to find MSEI restriction sites (forexample), then matchPattern("AN", Hsapiens[[ chromo.i ]] ) to find all thegaps that follow an 'A', matchPattern("NA", Hsapiens[[ chromo.i ]] ) tofind all gaps that are terminated by an 'A', and so on. Then using start()on each of the objects created, I can put together a data structure likethe one I described in the previous paragraph and then use findInterval todo lookups.

I'd be interested to know if there is a more _direct_ way to construct thelist of gaps or a unified list of sites and gaps with an annotation tosay which is which.

Also, I would be interested to know if there is an efficient way to do thelookup directly (for a given Chromo, Position, and Strand return therestriction site) without the intermediate step of constructing the listof all restriction sites. Typically, runs involve from the low thousandsto tens of thousands of retroviral integration events (read: genomicloci), but with higher throughput sequencing just around the corner, Iexpect this will rise. So, a one-liner that takes a couple of seconds torun for each event is impractical.


Can I count on start( my.xstringsviews.object ) being unique and in order?


TIA,

Chuck


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]                  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

[Bioc-sig-seq] Finding Restriction sites and gaps

Reply via email to