Re: [Biojava-l] How to parse large Genbank files?

Richard Holland Tue, 28 Jul 2009 07:32:46 -0700


Btw: Should we move this to Biojava-dev?


probably, yes! :)

And where do I sign up for BioJava3 development? ;-)

Andreas Prlic has the keys to the project these days. BJ3 does alreadyhave some new code in place for handling sequences as strings but it'sin an out-of-the-way bit of the repository and is not part of the mainroadmap for the project at present. The current focus is onmodularising the existing bits, so that individual components can berefactored to behave better at a future date.

If you want to explore my ideas for a replacement Sequence model, thecode and docs are here (sequence handling is in the 'core' module withDNA-specifics in the 'dna' module):


http://biojava.org/wiki/BioJava3:HowTo
http://www.biojava.org/wiki/BioJava3_project

(Methods such as file parsers would request Strings (or ideallyCharSequence - more flexible, and String extends it) as parameterswhenever they don't care about content - if they care about contentbut don't care in advance about size or random access then they shouldrequest Iterator<Symbol> which can be used to wrap a String and parseon demand, and if they need full functionality then they shouldrequest List<Symbol> which the default implementation of usesArrayLists but there's no reason a String-backed one could be writtenas well).


cheers,
Richard

- Florian
On Mon, Jul 27, 2009 at 8:16 PM, Florian

Mittag<[email protected]> wrote:
Hi Mark!

On Saturday, 25. July 2009 04:20, Mark Schreiber wrote:
I don't think anyone has done much or anything to optimize these
parsers. The process you outline sounds extremely inefficient. Itis
also likely to lead to memory leaks due to the number of copy
operations.
I wouldn't necessarily say that it leads to memory leaks, but it
definitively leads to a high memory consumption (2GB are notenough for a200MB file). Also, my outline of the process is based on only 2hours of
viewing the code, so actually I expected to be corrected on this.
Unfortunately, it seems like I did get the right idea and it ISextremely
inefficient.
I mean, I understand that this is a high level of abstraction thatmightcome in handy in many situations, but it certainly is more of anobstacle
in my specific case.
As always with java, don't try and optimize without a profilerwhichwill tell you which methods are taking a long time and whichobjects
take the most memory.
I think we should continue this discussion on the biojava-dev listor in
a private conversation, as it will probably get very detailed and
technical.


My question to this list again:
Is there a way to achieve my goal of parsing a 200MB Genbank filewith
the current biojava version without code changes?


- Florian
On 25 Jul 2009, 1:33 AM, "Florian Mittag"
<[email protected]> wrote:

Hi!
I think this is a problem worth of its own thread, so I'll startone:
I want to store all human chromosomes in a BioSQL database after I
loaded the
information from .gbk files. The files I get from NCBI with the
following URIs, where the id ranges from nc_000001 to nc_000024plus
nc_001804:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n
c_0 00023&rettype=gbwithparts&retmode=text

I then try to parse the files as described in
http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting
_fi les but it wont work. While there are no problems parsing1804 and
24, chromosome
23 leads to a OutOfMemory exception although I gave it 2GB of heap
space.
Here is a stack trace (the line numbers might differ, because Ialready
tried
to improve GenbankFormat.java in memory efficiency):
Exception in thread "main" java.lang.OutOfMemoryError: Java heapspace
       at
org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol
Lis tFactory.java:222) at
org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS
equ enceBuilder.java:256) at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav
a:5 35) at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead
er. java:110) at
org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma
in. java:537) at
org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java
:46 8) at
org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164)
The line in GenbankFormat.java is:

rlistener.addSymbols(
       symParser.getAlphabet(),
       (Symbol[])(sl.toList().toArray(new Symbol[0])),
       0, sl.length());
Sometimes it fails at the sl.toList().toArray()-part, sometimesit fails
later
inside the addSymbols method, but it always fails.

How can this be? I mean, the file is only 190MB in size, so 2GB of
memory should be more than enough. Browsing through the sourcecode, Idiscovered what I think of as very inefficient handling ofsequences:
1) the sequence string is read from file into a StringBuffer
2) it is converted to a string (with whitespaces removed)
3) a SimpleSymbolList is created out of the string
4) the SymbolList is converted to a List of Symbols
5) the List is converted to an array of Symbols
6) the array is passed to addSymbols
7) there it is added to a ChunkedSymbolListFactory
8) if at some point the sequence is requested, a SymbolList iscreated
and then converted to a string.
You see, there is a lot of copying and converting, but in the endI have
the same string I started with. Well, I had the string, if it ever
reached the end, because it will crash before completing thisprocess.
Am I doing something wrong or is there a great potential ofimproving
parsing
of Genbank files?


Regards,
  Florian
_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l
--
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091
--
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091



_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] How to parse large Genbank files?

Reply via email to