Re: [Biojava-l] Parsing massive blast-like output (was... Problemswith SAX parsing)

Matthew Pocock Fri, 14 Feb 2003 13:10:40 -0800

Great. Thanks Simon. I hate tracking down these reference leaks.

Matthew


Simon Brocklehurst wrote:

Re: Parsing massive Blast output

In regard of recent mail to the list (and from up to a couple of years ago)

Up 'til now, when attempting to parse *very* large blast outputs consisting of many (thousands of) separete reports concatenated together, the Java Virtual Machine could sometimes run out of memory. The workaround for this problem that people have been using was to split their output into smaller chunks that the parser can deal with.

This parsing problem was due to a small bug, which we've now (I think/hope) fixed in the biojava cvs (biojava-live).

The parser should now deal successfully with infinitely large amounts of data, without any need for chunking the output.

After applying this fix, the "BlastLike" SAX parser was tested for scalability in terms of handling large numbers of concatenated blast reports as follows:

Size measures of typical test input files:

o Tens of thousands of concatenated blast-like reports

o Tens of millions of individual lines of blast-like pairwise output data

o Gigabytes in size

Tests were run using JDK 1.4.1 on Solaris 9. Input data was parsed in such a way as to process all SAX events generated by the underlying SAX driver.

o For each test, the outputs from the parser were XML documents each of the order of hundreds of millions of lines in size.

o Memory footprint remained both small and constant throughout the parsing process, with a typical memory footprint under 14 MB in size.

Simon


--
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk

_______________________________________________
Biojava-l mailing list  -  [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l

Re: [Biojava-l] Parsing massive blast-like output (was... Problemswith SAX parsing)

Reply via email to