I think the problem is that you are populating the results List with /all/ of the blast data. This means that all the data from the complete report must be in memory in this List. A better approach is to write an object to replace builder in adapter.setSearchContentHandler(builder), which does all the processing as the data streams in from the parser. This will keep memory consumption down to the bare minimum.
There is some code that does this sort of thing in demos/ssbind, and it may be worth scanning the code for BlastLikeSearchBuilder for ideas.
Best,
Matthew
VERHOEF Frans wrote:
Hi Keith,
Thanks for your response. I did paste the method that's doing the parsing somewhere below. I also ran just now this method trying to parse a blast output file with a size of approximately 350mb. The output generated is this:
Before parsing: 402280 After parsing: 1043162496
With the number indicating the memory size of java in bytes. That means that during the parsing (all biojava) the size explodes from a mere 402kb to 1gb. After that the size doesn't do much anymore.
For your information, I am using the following:
- NCBI Blast 2.2.4
- Java 1.4.2_01
- Linux - Biojava from cvs, last updated at 21st of October
Hopefully you will now tell me I am doing something stupid ;-)
private void parseBlastOutput(File file) throws Exception{
Runtime r = Runtime.getRuntime();
System.out.println("Before parsing: " +
(r.totalMemory()-r.freeMemory()));
InputStream is = new FileInputStream(file);
BlastLikeSAXParser parser = new BlastLikeSAXParser();
parser.setModeLazy();
SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
parser.setContentHandler(adapter);
List results = new ArrayList();
SearchContentHandler builder = new BlastLikeSearchBuilder(results,
new DummySequenceDB("queries"), new DummySequenceDBInstallation());
adapter.setSearchContentHandler(builder);
parser.parse(new InputSource(is));
for (Iterator i = results.iterator(); i.hasNext(); ){
System.out.println("Iterating: " +
(r.totalMemory()-r.freeMemory()));
SeqSimilaritySearchResult result =
(SeqSimilaritySearchResult)i.next();
org.biojava.bio.Annotation anno = result.getAnnotation();
String queryID = (String)anno.getProperty("queryId");
String database =
this.parseNameFromDBPath((String)anno.getProperty("databaseId"));
String lib = this.parseIDForLibrary(queryID);
BlastSetting bsetting = null;
if (lib!=null && database!=null) bsetting =
adaptor.fetchSetting(lib, database);
if (lib == null || database == null || bsetting == null){
//means no blast setting can be found for this library and
database
System.out.println("HELP!!!!!");
throw new Exception("Cannot find Blast Setting in database
for library " + lib + " and blastdatabase " + database);
}
File outFile = new File(destDir, queryID + ".out");
BufferedWriter out = new BufferedWriter(new
FileWriter(outFile));
out.write("queryID\tqueryStart\tqueryEnd\tdatabase\tsubjectID\tsubjectSt
art\tsubjectEnd\tscore\teValue\tDescription\n");
List hits = result.getHits();
//System.out.println("Start writing with " + hits.size() + "
hits.");
for (int j=0; j<hits.size(); j++){ SeqSimilaritySearchHit hit =
(SeqSimilaritySearchHit)hits.get(j);
if (hit.getEValue() > bsetting.getMaxEValue()){
break;
}
//System.out.println("HIT!!!");
org.biojava.bio.Annotation hitAnno = hit.getAnnotation();
String description =
hitAnno.containsProperty("subjectDescription") ?
(String)hitAnno.getProperty("subjectDescription") : "No Description";
out.write(queryID + "\t");
out.write(hit.getQueryStart() + "\t");
out.write(hit.getQueryEnd() + "\t");
out.write(database + "\t");
out.write(hit.getSubjectID() + "\t");
out.write(hit.getSubjectStart() + "\t");
out.write(hit.getSubjectEnd() + "\t");
out.write(hit.getScore() + "\t");
out.write(hit.getEValue() + "\t");
out.write(description + "\n");
out.flush();
hitAnno = null;description = null;hit=null;
System.gc();
}
out.close();
hits = null; out=null; outFile=null; bsetting=null; lib=null;
database=null; queryID=null; anno=null; result=null;
System.gc();
}
file.delete();
}
-----Original Message----- From: Keith James [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 12, 2003 12:25 AM To: VERHOEF Frans Cc: [EMAIL PROTECTED] Subject: Re: [Biojava-l] BLAST parsing explodes in size
"FV" == VERHOEF Frans <[EMAIL PROTECTED]> writes:
FV> Hi, I am having a problem parsing huge blast FV> results. Basically I am parsing the blast results pretty much FV> the same way as in "Biojava in Anger", with as only difference FV> that I use the setModeLazy() of the BlastLikeSAXParser, since FV> I am using NCBI Blast version 2.2.4 and that version is not FV> recognised by the parser yet.
Using blast 2.2.4 or 2.2.6 is safe in lazy mode - diffs show only minor whitespace changes in the format.
FV> Besides that the only difference lays in the things I do with FV> the data.
This is likely to be the cause of the problem. See below.
FV> The problem is that when I parse a blast result that is a few FV> hundred MB, for example 300MB, the java application is FV> ballooning up to around 1.6GB of memory. Sometimes the FV> application even crashes because I only have got 2GB to play FV> with.
The parser uses an event driven framework which is designed to handle very big data - it will handle multi-GB reports. However, if you create many fine-grained objects for every element of every report you will quickly run out of resources.
FV> Does anyone know what's causing this? Is it because I set the FV> lazy mode? Is there any way to work around it?
Either you need to think about which elements of the report you are interested in and build a filter which captures those events, discarding the rest. See the demos/ssbind package for an example by Matthew. Or if you really need all those objects then you should look at allowing them to be garbage-collected as soon as possible.
It is possible that there is a bug somewhere, but without any seeing any code it isn't possible to say much more. If you need more help, post a short (working) piece of code illustrating the problem and we will do our best.
hth
Keith
--
- Keith James <[EMAIL PROTECTED]> Microarray Facility, Team 65 -
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK -
_______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l
_______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l
