On Thu, Feb 25, 2010 at 10:08 AM, Charles Imbusch <[email protected]> wrote: > > Dear Peter, > > thanks for your mail. I will try to make use of that index > to speed things up when I have time available. > > Cheers, > Charles
Hi Charles, If found when you want random access to the reads, loading the provided .mft or .srt index is MUCH faster than scanning the whole file to build the index manually. So this really is worth the effort. I hope the comments in my code are reasonably clear, but to recap the key idea of the index block is you get chunks of data of varying length (although typically all the same length since by default all the Roche reads have the same read length) like this name, null char, four character offset, terminator char of 0xFF You divide the index block into entries for each read by finding the 0xFF terminators. Because 0xFF (decimal 255) is used in this way, it cannot be used to encode the offsets which must only use 0x00 to 0xFE (decimal 0 to 254). The offset therefore uses base 255 instead of base 256. Note that this means that the largest offset the current Roche index blocks can hold is 255^4, or a little under 4GB. If you use the Roche tools to try and merge SFF files to make an example SFF file over 4GB you get a warning that there will be no index (and no manifest). The index holds the reads sorted alphabetically by name. We don't take advantage of this in Biopython since I use a Python dictionary (like a Perl hash) to store the offsets. In case you missed them, I'd like to draw your attention to the SFF files I am using in the Biopython unit tests: http://github.com/biopython/biopython/tree/master/Tests/Roche/ Regards, Peter _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
