Richard HOLLAND wrote:
What is required for files this size is a SeqIOTools parser that reads
sequence objects _on demand_ as requested by the iterator, rather than
reading the whole lot at once.
This brings up a related issue that I'm grappling with at the moment...
I would like to have biojava parse a large sequence file and then
periodically extract arbitrary subsequences. As currently implemented,
it seems that in order to extract a subsequence, the entire sequence
entry must be loaded from the GenBank/FastA/whatever file into memory.
This becomes a problem when dealing with large chromosomal data sets of
the type displayed in the Mauve alignment viewer. Yes, I'm aware of the
PackedSymbolList. Unfortunately, mammalian genomes are around 3
gigabases, requiring around 700MB each using a 2 bits per base encoding.
Given that it won't be practical to store the entire sequence in memory,
the next best solution would be keeping an in-memory index of relevant
sequence file offsets. Enter BioJava's IndexStore. Unless I've
misunderstood the documentation, the IndexStore family of classes index
sequence files on a per-contig/per-entry basis. Such a scheme creates
rather sparse indexes for chromosomes that can be > 100MB in length.
What seems ideal would be an implementation of SeqIOTools that could
read a GenBank/FastA file and construct a Sequence-derivative object
with lazy references to the data. The Sequence-derived class would also
need mappings of sequence coordinates to file offsets so that reading a
10 character subsequence n...n+10 doesn't require also reading
subsequence 1...n-1. I implemented a similar scheme in a small c++
library called libGenome years ago and it makes manipulating large data
sets a breeze.
Echoing Richard's question for this slightly different problem:
Can someone clarify if a lazy-loading parser/database implementation
already exists for situations like this, or does one need to be written?
Thanks for Biojava, and thanks for any feedback
-Aaron
btw: I also brought this up at the BOSC biojava BOF but we were rather
abruptly ushered out of the meeting room by an anxious hotel staffer
prior to reaching a conclusion.
_______________________________________________
Biojava-l mailing list - Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l