John’s description is the way I’ve implemented something similar using HTSJDK 
library and appears to work well.

I’m not sure if HTSLIB implements all the same methods as the HTSJDK but in the 
JDK I open the file twice.
The first file handle is used to scan the genome then the to find events that 
match my criteria then the second file handle is used with the HTSJDK queryMate 
function to retrieve the mate for the reads of interest.

If similar functions are implemented in HTSLIB you could do something similar.

Actually the best option might be to combine the two ideas. From a SAM record 
you can query how far apart the mates are if they are within say 500 bases you 
could store them in a hash table until you meet the mate then write them out to 
disk.
This would allow you to calculate the rough amount of memory needed as a 
function of the insert size range allowed and the depth of coverage of your bam 
file.

For reads that map outside the range stored in the hash table (outliers or 
mates on diff chromosomes) you could then use your second file handle with 
queryMate or it’s equivalent to retrieve the mate. This would reduce the amount 
of memory you need for the hash table and also reduce the amount of IO you need.

Chad Harland
B34, GIGA-R : Génomique animale, 
University of Liège
1 Avenue de l'Hôpital 
4000-Liège, Belgium
charl...@ulg.ac.be
GSM +32484222075


> On 8/12/2015, at 1:44 PM, John Marshall <j...@sanger.ac.uk> wrote:
> 
> On 7 Dec 2015, at 21:47, Claudio Alberti <claudio.albe...@epfl.ch> wrote:
>> I am implementing a parser that is able to read the BAM file in pairs so 
>> whenever I read a record where pos < mpos I search for the mate and I 
>> create a pair structure.
>> Once I find the mate I have to roll back to the second read and continue 
>> building the pairs.
> 
> The sensible way to do this using the existing HTSlib API would be to open 
> the BAM file twice.  Use one file handle to read sequentially, and use a 
> separate second file handle (via sam_itr_queryi()/sam_itr_next()) to search 
> for the mates.  No need to "roll back" as the first file handle is still in 
> the right place.
> 
> (This would not work for streaming from standard input, but of course your 
> mate-reader requires seeking and an index, so already does not work for 
> standard input.)
> 
> To be sure, it would be useful if HTSlib provided seek and tell functions 
> that worked at the high-level htsFile interface, so could be used alongside 
> sam_read1(), bcf_read(), etc.  This would be useful so that people could 
> build interesting new indexing structures, though in this case there would be 
> more seeking involved than in a two file handle implementation, so it would 
> be somewhat slower.  Until Ryan's (the September thread that James pointed 
> you at) and your requests, noone had expressed a desire for such functions.  
> Now they have, so such functions will be added, probably in a HTSlib 1.4 
> release.
> 
>    John
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> 
> ------------------------------------------------------------------------------
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
> _______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help


------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to