Hi Mark! On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: > I don't think anyone has done much or anything to optimize these parsers. > The process you outline sounds extremely inefficient. It is also likely to > lead to memory leaks due to the number of copy operations.
I wouldn't necessarily say that it leads to memory leaks, but it definitively leads to a high memory consumption (2GB are not enough for a 200MB file). Also, my outline of the process is based on only 2 hours of viewing the code, so actually I expected to be corrected on this. Unfortunately, it seems like I did get the right idea and it IS extremely inefficient. I mean, I understand that this is a high level of abstraction that might come in handy in many situations, but it certainly is more of an obstacle in my specific case. > As always with java, don't try and optimize without a profiler which will > tell you which methods are taking a long time and which objects take the > most memory. I think we should continue this discussion on the biojava-dev list or in a private conversation, as it will probably get very detailed and technical. My question to this list again: Is there a way to achieve my goal of parsing a 200MB Genbank file with the current biojava version without code changes? - Florian > On 25 Jul 2009, 1:33 AM, "Florian Mittag" <[email protected]> > wrote: > > Hi! > > I think this is a problem worth of its own thread, so I'll start one: > > I want to store all human chromosomes in a BioSQL database after I loaded > the > information from .gbk files. The files I get from NCBI with the following > URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0 >00023&rettype=gbwithparts&retmode=text > > I then try to parse the files as described in > http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi >les but it wont work. While there are no problems parsing 1804 and 24, > chromosome > 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. > > Here is a stack trace (the line numbers might differ, because I already > tried > to improve GenbankFormat.java in memory efficiency): > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis >tFactory.java:222) at > org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ >enceBuilder.java:256) at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5 >35) at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader. >java:110) at > org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main. >java:537) at > org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46 >8) at > org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) > > The line in GenbankFormat.java is: > > rlistener.addSymbols( > symParser.getAlphabet(), > (Symbol[])(sl.toList().toArray(new Symbol[0])), > 0, sl.length()); > > Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails > later > inside the addSymbols method, but it always fails. > > How can this be? I mean, the file is only 190MB in size, so 2GB of memory > should be more than enough. Browsing through the source code, I discovered > what I think of as very inefficient handling of sequences: > > 1) the sequence string is read from file into a StringBuffer > 2) it is converted to a string (with whitespaces removed) > 3) a SimpleSymbolList is created out of the string > 4) the SymbolList is converted to a List of Symbols > 5) the List is converted to an array of Symbols > 6) the array is passed to addSymbols > 7) there it is added to a ChunkedSymbolListFactory > 8) if at some point the sequence is requested, a SymbolList is created and > then converted to a string. > > You see, there is a lot of copying and converting, but in the end I have > the same string I started with. Well, I had the string, if it ever reached > the end, because it will crash before completing this process. > > > Am I doing something wrong or is there a great potential of improving > parsing > of Genbank files? > > > Regards, > Florian > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Dipl. Inf. Florian Mittag Universität Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
