Richard HOLLAND wrote:

What is required for files this size is a SeqIOTools parser that reads
sequence objects _on demand_ as requested by the iterator, rather than
reading the whole lot at once.
This brings up a related issue that I'm grappling with at the moment... I would like to have biojava parse a large sequence file and then periodically extract arbitrary subsequences. As currently implemented, it seems that in order to extract a subsequence, the entire sequence entry must be loaded from the GenBank/FastA/whatever file into memory. This becomes a problem when dealing with large chromosomal data sets of the type displayed in the Mauve alignment viewer. Yes, I'm aware of the PackedSymbolList. Unfortunately, mammalian genomes are around 3 gigabases, requiring around 700MB each using a 2 bits per base encoding.

Given that it won't be practical to store the entire sequence in memory, the next best solution would be keeping an in-memory index of relevant sequence file offsets. Enter BioJava's IndexStore. Unless I've misunderstood the documentation, the IndexStore family of classes index sequence files on a per-contig/per-entry basis. Such a scheme creates rather sparse indexes for chromosomes that can be > 100MB in length. What seems ideal would be an implementation of SeqIOTools that could read a GenBank/FastA file and construct a Sequence-derivative object with lazy references to the data. The Sequence-derived class would also need mappings of sequence coordinates to file offsets so that reading a 10 character subsequence n...n+10 doesn't require also reading subsequence 1...n-1. I implemented a similar scheme in a small c++ library called libGenome years ago and it makes manipulating large data sets a breeze.

Echoing Richard's question for this slightly different problem:

Can someone clarify if a lazy-loading parser/database implementation
already exists for situations like this, or does one need to be written?

Thanks for Biojava, and thanks for any feedback
-Aaron

btw: I also brought this up at the BOSC biojava BOF but we were rather abruptly ushered out of the meeting room by an anxious hotel staffer prior to reaching a conclusion.
_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l

Reply via email to