Btw: Should we move this to Biojava-dev?

probably, yes! :)

And where do I sign up for BioJava3 development? ;-)

Andreas Prlic has the keys to the project these days. BJ3 does already have some new code in place for handling sequences as strings but it's in an out-of-the-way bit of the repository and is not part of the main roadmap for the project at present. The current focus is on modularising the existing bits, so that individual components can be refactored to behave better at a future date.

If you want to explore my ideas for a replacement Sequence model, the code and docs are here (sequence handling is in the 'core' module with DNA-specifics in the 'dna' module):

http://biojava.org/wiki/BioJava3:HowTo
http://www.biojava.org/wiki/BioJava3_project

(Methods such as file parsers would request Strings (or ideally CharSequence - more flexible, and String extends it) as parameters whenever they don't care about content - if they care about content but don't care in advance about size or random access then they should request Iterator<Symbol> which can be used to wrap a String and parse on demand, and if they need full functionality then they should request List<Symbol> which the default implementation of uses ArrayLists but there's no reason a String-backed one could be written as well).

cheers,
Richard


- Florian

On Mon, Jul 27, 2009 at 8:16 PM, Florian

Mittag<[email protected]> wrote:
Hi Mark!

On Saturday, 25. July 2009 04:20, Mark Schreiber wrote:
I don't think anyone has done much or anything to optimize these
parsers. The process you outline sounds extremely inefficient. It is
also likely to lead to memory leaks due to the number of copy
operations.

I wouldn't necessarily say that it leads to memory leaks, but it
definitively leads to a high memory consumption (2GB are not enough for a 200MB file). Also, my outline of the process is based on only 2 hours of
viewing the code, so actually I expected to be corrected on this.
Unfortunately, it seems like I did get the right idea and it IS extremely
inefficient.

I mean, I understand that this is a high level of abstraction that might come in handy in many situations, but it certainly is more of an obstacle
in my specific case.

As always with java, don't try and optimize without a profiler which will tell you which methods are taking a long time and which objects
take the most memory.

I think we should continue this discussion on the biojava-dev list or in
a private conversation, as it will probably get very detailed and
technical.


My question to this list again:
Is there a way to achieve my goal of parsing a 200MB Genbank file with
the current biojava version without code changes?


- Florian

On 25 Jul 2009, 1:33 AM, "Florian Mittag"
<[email protected]> wrote:

Hi!

I think this is a problem worth of its own thread, so I'll start one:

I want to store all human chromosomes in a BioSQL database after I
loaded the
information from .gbk files. The files I get from NCBI with the
following URIs, where the id ranges from nc_000001 to nc_000024 plus
nc_001804:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n
c_0 00023&rettype=gbwithparts&retmode=text

I then try to parse the files as described in
http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting
_fi les but it wont work. While there are no problems parsing 1804 and
24, chromosome
23 leads to a OutOfMemory exception although I gave it 2GB of heap
space.

Here is a stack trace (the line numbers might differ, because I already
tried
to improve GenbankFormat.java in memory efficiency):

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
       at
org .biojava .bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol
Lis tFactory.java:222) at
org .biojavax .bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS
equ enceBuilder.java:256) at
org .biojavax .bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav
a:5 35) at
org .biojavax .bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead
er. java:110) at
org .prodge .sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma
in. java:537) at
org .prodge .sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java
:46 8) at
org .prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java: 164)

The line in GenbankFormat.java is:

rlistener.addSymbols(
       symParser.getAlphabet(),
       (Symbol[])(sl.toList().toArray(new Symbol[0])),
       0, sl.length());

Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails
later
inside the addSymbols method, but it always fails.

How can this be? I mean, the file is only 190MB in size, so 2GB of
memory should be more than enough. Browsing through the source code, I discovered what I think of as very inefficient handling of sequences:

1) the sequence string is read from file into a StringBuffer
2) it is converted to a string (with whitespaces removed)
3) a SimpleSymbolList is created out of the string
4) the SymbolList is converted to a List of Symbols
5) the List is converted to an array of Symbols
6) the array is passed to addSymbols
7) there it is added to a ChunkedSymbolListFactory
8) if at some point the sequence is requested, a SymbolList is created
and then converted to a string.

You see, there is a lot of copying and converting, but in the end I have
the same string I started with. Well, I had the string, if it ever
reached the end, because it will crash before completing this process.


Am I doing something wrong or is there a great potential of improving
parsing
of Genbank files?


Regards,
  Florian
_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091

--
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091


_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to