John’s description is the way I’ve implemented something similar using HTSJDK library and appears to work well.
I’m not sure if HTSLIB implements all the same methods as the HTSJDK but in the JDK I open the file twice. The first file handle is used to scan the genome then the to find events that match my criteria then the second file handle is used with the HTSJDK queryMate function to retrieve the mate for the reads of interest. If similar functions are implemented in HTSLIB you could do something similar. Actually the best option might be to combine the two ideas. From a SAM record you can query how far apart the mates are if they are within say 500 bases you could store them in a hash table until you meet the mate then write them out to disk. This would allow you to calculate the rough amount of memory needed as a function of the insert size range allowed and the depth of coverage of your bam file. For reads that map outside the range stored in the hash table (outliers or mates on diff chromosomes) you could then use your second file handle with queryMate or it’s equivalent to retrieve the mate. This would reduce the amount of memory you need for the hash table and also reduce the amount of IO you need. Chad Harland B34, GIGA-R : Génomique animale, University of Liège 1 Avenue de l'Hôpital 4000-Liège, Belgium charl...@ulg.ac.be GSM +32484222075 > On 8/12/2015, at 1:44 PM, John Marshall <j...@sanger.ac.uk> wrote: > > On 7 Dec 2015, at 21:47, Claudio Alberti <claudio.albe...@epfl.ch> wrote: >> I am implementing a parser that is able to read the BAM file in pairs so >> whenever I read a record where pos < mpos I search for the mate and I >> create a pair structure. >> Once I find the mate I have to roll back to the second read and continue >> building the pairs. > > The sensible way to do this using the existing HTSlib API would be to open > the BAM file twice. Use one file handle to read sequentially, and use a > separate second file handle (via sam_itr_queryi()/sam_itr_next()) to search > for the mates. No need to "roll back" as the first file handle is still in > the right place. > > (This would not work for streaming from standard input, but of course your > mate-reader requires seeking and an index, so already does not work for > standard input.) > > To be sure, it would be useful if HTSlib provided seek and tell functions > that worked at the high-level htsFile interface, so could be used alongside > sam_read1(), bcf_read(), etc. This would be useful so that people could > build interesting new indexing structures, though in this case there would be > more seeking involved than in a two file handle implementation, so it would > be somewhat slower. Until Ryan's (the September thread that James pointed > you at) and your requests, noone had expressed a desire for such functions. > Now they have, so such functions will be added, probably in a HTSlib 1.4 > release. > > John > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 > _______________________________________________ > Samtools-help mailing list > Samtools-help@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/samtools-help ------------------------------------------------------------------------------ Go from Idea to Many App Stores Faster with Intel(R) XDK Give your users amazing mobile app experiences with Intel(R) XDK. Use one codebase in this all-in-one HTML5 development environment. Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help