I think if you want to use Java the nio packages are the way to go. Just my $0.02
Dan Baggott <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 01/28/2005 07:01 AM Please respond to baggott2 To: biojava-list List <biojava-l@biojava.org> cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] reading nib sequence files That question started off a flurry... Thanks for the input! So, from my narrow and selfish perspective, the short of this thread is that there isn't any "ready to go" nib i/o code and that the existing BioJava parsing framework is not designed to deal with binary files so it would be less than trivial to adapt it. I don't have much experience with reading from large files (binary or otherwise). Is there a general consensus on the path of least resistance for implementing fast random access to large-ish nucleotide sequences (ie on the order of human chromosome sized)? I'm not so concerned about the size of the sequence files, just speed of access. I mentioned the nib format in the first place becuase I was impressed with the speed at which Jim Kent's nibFrag utility extracts sequence -- pretty much immediately from the human perspective. Dan On Tue, 25 Jan 2005 08:29:37 +1300, Smithies, Russell <[EMAIL PROTECTED]> wrote: > You don't need to extract the whole file with ZipInputStream first. > I managed to get the part I wanted by setting the offset to the start of > the sequence (was using zipped chromosomes in fasta format) and the > buffer to the length I wanted. > It was a year or 2 ago and I probably don't have the code anymore but it > is possible ;-) > > Russell Smithies > > Bioinformatics Software Developer > AgResearch Invermay > Private Bag 50034 > Puddle Alley > Mosgiel > New Zealand > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Richard > HOLLAND > > Sent: Monday, 24 January 2005 10:19 p.m. > To: VERHOEF Frans; [EMAIL PROTECTED] > Cc: biojava-list List; Thomas Down > Subject: RE: [Biojava-l] reading nib sequence files > > The trouble with ZIP is that to do random-access reads of the sequence > (eg. give me all bases from X to Y) you have to unzip the whole sequence > each time. That makes it quite a bit slower. The solution needs to be a > compression algorithm of some kind which allows instant random access > without slowing down the create/update process too much either. Hence a > custom fixed-width binary solution would be the first thing that comes > to mind, but it may not be the only one. > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are not the > intended recipient, please delete it and notify us immediately. Please > do not copy or use it for any purpose, or disclose its content to any > other person. Thank you. > --------------------------------------------- > > > -----Original Message----- > > From: VERHOEF Frans > > Sent: Monday, January 24, 2005 5:16 PM > > To: Richard HOLLAND; [EMAIL PROTECTED] > > Cc: Thomas Down; biojava-list List > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > You could always ZIPStream it out for even more compression. > > > > Frans > > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of > > Richard HOLLAND > > Sent: Monday, January 24, 2005 04:59 PM > > To: [EMAIL PROTECTED] > > Cc: Thomas Down; biojava-list List > > Subject: RE: [Biojava-l] reading nib sequence files > > > > NIB files store one base per 4 bits, non-variable, giving a > > 50% compression rate and a maximum arity of 16 different base > > values per position. > > > > Richard Holland > > Bioinformatics Specialist > > GIS extension 8199 > > > > --------------------------------------------- > > This email is confidential and may be privileged. If you are > > not the intended recipient, please delete it and notify us > > immediately. Please do not copy or use it for any purpose, or > > disclose its content to any other person. Thank you. > > --------------------------------------------- > > > > > > > -----Original Message----- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] > > > Sent: Monday, January 24, 2005 4:53 PM > > > To: Richard HOLLAND > > > Cc: [EMAIL PROTECTED]; biojava-list List; Thomas Down > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > > > > BioJava does already do some compression on large sequences > > > (or at least > > > it used to). Like you say you can bit pack a lot. Ambiguity causes > > > problems as you can have more than four symbols for DNA > > > (including n, y, r > > > etc). > > > > > > Does Jim Kent's schema offer better compression? Even if it > > > doens't the > > > use of a ByteBuffer will probably increase the speed of the current > > > implementations. > > > > > > - Mark > > > > > > > > > > > > > > > > > > "Richard HOLLAND" <[EMAIL PROTECTED]> > > > 01/24/2005 04:47 PM > > > > > > > > > To: Mark Schreiber/GP/[EMAIL PROTECTED], "Thomas Down" > > > <[EMAIL PROTECTED]> > > > cc: "biojava-list List" <biojava-l@biojava.org>, > > > <[EMAIL PROTECTED]> > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > > > > I think the idea of storing sequences internally as > > compressed binary > > > sequence would be a good idea regardless, for any symbol list. > > > Currently each Symbol in a SymbolList requires one word of > > memory (the > > > size of a memory pointer to the singleton Symbol > > instances). Therefore > > > any SymbolList of length X containing symbols from an n-ary > > alphabet > > > would require X words of memory to store it, plus the > > overhead of the > > > SymbolList and n Symbol singleton instances (admittedly > > shared between > > > all SymbolLists currently in memory). > > > > > > If you used a compressed binary format internally, doing away with > > > explicit Symbol references and representing each symbol in a > > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G > > > etc.), you would require much less space than even the > > singleton model > > > above. This > > > way you could fit four DNA symbols into a single byte of memory, as > > > opposed to four words of memory. The number of bits required for a > > > symbol in any given alphabet is merely log base 2 of the size of the > > > alphabet, rounded up to the nearest whole number. eg. for > > the English > > > alphabet of 26 letters only, you would need 5 bits, or in > > > terms of whole > > > bytes, you would be able to fit 8 symbols into 5 bytes. > > > > > > To do this you would need to define a 'bits' parameter on > > the alphabet > > > which is calculated from the number of symbols in the alphabet, a > > > 'bitMap' parameter on the alphabet which maps symbols to bit values > > > (and vice versa with 'inverseBitMap'), and keep a separate > > > 'length' parameter > > > in the SymbolList which would be used to tell the binary > > > decoder when to > > > stop parsing the sequence (as you can only store whole bytes, > > > there will > > > often be trailing zeroes in the buffer which could be > > > misleading without > > > this extra parameter). > > > > > > You could always return singleton Symbol objects if requested, by > > > decoding the binary sequence on the fly, but you would no > > longer need > > > to store the sequence using them. > > > > > > Is this worth considering for the big BioJava rewrite? > > > > > > Richard Holland > > > Bioinformatics Specialist > > > GIS extension 8199 > > > > > > --------------------------------------------- > > > This email is confidential and may be privileged. If you > > are not the > > > intended recipient, please delete it and notify us > > immediately. Please > > > do not copy or use it for any purpose, or disclose its > > content to any > > > other person. Thank you. > > > --------------------------------------------- > > > > > > > > > > -----Original Message----- > > > > From: [EMAIL PROTECTED] > > > > [mailto:[EMAIL PROTECTED] > > > > Sent: Monday, January 24, 2005 4:37 PM > > > > To: Thomas Down > > > > Cc: biojava-list List; Richard HOLLAND; > > > > "<[EMAIL PROTECTED]"@novartis.com > > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > > > > > > > > > > > > > Thomas Down <[EMAIL PROTECTED]> > > > > 01/24/2005 04:34 PM > > > > > > > > > > > > To: "Richard HOLLAND" <[EMAIL PROTECTED]> > > > > cc: "<[EMAIL PROTECTED]>", biojava-list List > > > > <biojava-l@biojava.org>, Mark > > > > Schreiber/GP/[EMAIL PROTECTED] > > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > > > > > It's a compressed binary format. I doubt BioJava would be > > > > able to read > > > > > it without a lot of effort as the current parser framework > > > > is set up > > > > > for > > > > > text input only. > > > > > > > > Nib support probably wouldn't fit into the text-oriented parsing > > > > framework, but I'm sure it could be supported somehow if > > there was > > > > demand. A quick google doesn't turn up any format > > > documentation, but > > > > Jim Kent's IO code is at: > > > > > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > > > > > One interesting way to handle this might be to open the nib > > > file as a > > > > MappedByteBuffer, and back a SymbolList directly using that -- > > > > potentially giving us an efficient way of working with huge > > > > sequences.. > > > > Any interest in that? > > > > > > > > Thomas. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l@biojava.org > > http://biojava.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l