BioJava does already do some compression on large sequences (or at least it used to). Like you say you can bit pack a lot. Ambiguity causes problems as you can have more than four symbols for DNA (including n, y, r etc).
Does Jim Kent's schema offer better compression? Even if it doens't the use of a ByteBuffer will probably increase the speed of the current implementations. - Mark "Richard HOLLAND" <[EMAIL PROTECTED]> 01/24/2005 04:47 PM To: Mark Schreiber/GP/[EMAIL PROTECTED], "Thomas Down" <[EMAIL PROTECTED]> cc: "biojava-list List" <biojava-l@biojava.org>, <[EMAIL PROTECTED]> Subject: RE: [Biojava-l] reading nib sequence files I think the idea of storing sequences internally as compressed binary sequence would be a good idea regardless, for any symbol list. Currently each Symbol in a SymbolList requires one word of memory (the size of a memory pointer to the singleton Symbol instances). Therefore any SymbolList of length X containing symbols from an n-ary alphabet would require X words of memory to store it, plus the overhead of the SymbolList and n Symbol singleton instances (admittedly shared between all SymbolLists currently in memory). If you used a compressed binary format internally, doing away with explicit Symbol references and representing each symbol in a ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you would require much less space than even the singleton model above. This way you could fit four DNA symbols into a single byte of memory, as opposed to four words of memory. The number of bits required for a symbol in any given alphabet is merely log base 2 of the size of the alphabet, rounded up to the nearest whole number. eg. for the English alphabet of 26 letters only, you would need 5 bits, or in terms of whole bytes, you would be able to fit 8 symbols into 5 bytes. To do this you would need to define a 'bits' parameter on the alphabet which is calculated from the number of symbols in the alphabet, a 'bitMap' parameter on the alphabet which maps symbols to bit values (and vice versa with 'inverseBitMap'), and keep a separate 'length' parameter in the SymbolList which would be used to tell the binary decoder when to stop parsing the sequence (as you can only store whole bytes, there will often be trailing zeroes in the buffer which could be misleading without this extra parameter). You could always return singleton Symbol objects if requested, by decoding the binary sequence on the fly, but you would no longer need to store the sequence using them. Is this worth considering for the big BioJava rewrite? Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] > Sent: Monday, January 24, 2005 4:37 PM > To: Thomas Down > Cc: biojava-list List; Richard HOLLAND; > "<[EMAIL PROTECTED]"@novartis.com > Subject: Re: [Biojava-l] reading nib sequence files > > > I'd need to brush up on my nio, and my c ! > > > > > > Thomas Down <[EMAIL PROTECTED]> > 01/24/2005 04:34 PM > > > To: "Richard HOLLAND" <[EMAIL PROTECTED]> > cc: "<[EMAIL PROTECTED]>", biojava-list List > <biojava-l@biojava.org>, Mark > Schreiber/GP/[EMAIL PROTECTED] > Subject: Re: [Biojava-l] reading nib sequence files > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > It's a compressed binary format. I doubt BioJava would be > able to read > > it without a lot of effort as the current parser framework > is set up > > for > > text input only. > > Nib support probably wouldn't fit into the text-oriented parsing > framework, but I'm sure it could be supported somehow if there was > demand. A quick google doesn't turn up any format documentation, but > Jim Kent's IO code is at: > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > One interesting way to handle this might be to open the nib file as a > MappedByteBuffer, and back a SymbolList directly using that -- > potentially giving us an efficient way of working with huge > sequences.. > Any interest in that? > > Thomas. > > > > > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l