I think this would be easily do-able with biojava. It would require a custom implementation of Sequence and, due to the beauty of interfaces you probably wouldn't even know you were dealing with an assembly, (except sometimes it might be a bit slow while collecting data).
Like you say you could use IndexStore. I might also be worth looking at how Dazzle deals with DAS to see if you can steal anything from there. Ideally the SequenceBuilders called (eventually) by SeqIOTools should decide what kind of Sequence implementation you get back. For example, small sequences get SimpleSequence, mid sized get PackedSymbolList, and really large ones get some kind of lazy loaded sequence. Before diving in it would be interesting to know if it is the big sequence or the thousands of features that cause large sequences to be problematic. If it's features you would need to lazy load those as well (which could be problematic). - Mark Aaron Darling <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 07/04/2005 02:35 PM To: biojava-l@biojava.org, Paul Infield-Harm <[EMAIL PROTECTED]> cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] Dealing with huge sequences (was: "memory leak while reading nr.fasta") Richard HOLLAND wrote: >What is required for files this size is a SeqIOTools parser that reads >sequence objects _on demand_ as requested by the iterator, rather than >reading the whole lot at once. > This brings up a related issue that I'm grappling with at the moment... I would like to have biojava parse a large sequence file and then periodically extract arbitrary subsequences. As currently implemented, it seems that in order to extract a subsequence, the entire sequence entry must be loaded from the GenBank/FastA/whatever file into memory. This becomes a problem when dealing with large chromosomal data sets of the type displayed in the Mauve alignment viewer. Yes, I'm aware of the PackedSymbolList. Unfortunately, mammalian genomes are around 3 gigabases, requiring around 700MB each using a 2 bits per base encoding. Given that it won't be practical to store the entire sequence in memory, the next best solution would be keeping an in-memory index of relevant sequence file offsets. Enter BioJava's IndexStore. Unless I've misunderstood the documentation, the IndexStore family of classes index sequence files on a per-contig/per-entry basis. Such a scheme creates rather sparse indexes for chromosomes that can be > 100MB in length. What seems ideal would be an implementation of SeqIOTools that could read a GenBank/FastA file and construct a Sequence-derivative object with lazy references to the data. The Sequence-derived class would also need mappings of sequence coordinates to file offsets so that reading a 10 character subsequence n...n+10 doesn't require also reading subsequence 1...n-1. I implemented a similar scheme in a small c++ library called libGenome years ago and it makes manipulating large data sets a breeze. Echoing Richard's question for this slightly different problem: >Can someone clarify if a lazy-loading parser/database implementation >already exists for situations like this, or does one need to be written? > > > Thanks for Biojava, and thanks for any feedback -Aaron btw: I also brought this up at the BOSC biojava BOF but we were rather abruptly ushered out of the meeting room by an anxious hotel staffer prior to reaching a conclusion. _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l