This is one big problem, and I've come across it before. SeqIOTools.fileToBiojava reads the whole file in at once and stores everything in memory as Sequence objects in a virtual sequence database. For a file the size of nr, this is simply impossible on most machines, and causes out-of-memory exceptions.
What is required for files this size is a SeqIOTools parser that reads sequence objects _on demand_ as requested by the iterator, rather than reading the whole lot at once. This way it can drop sequence objects once they have been passed over by the iterator, freeing up memory for subsequent ones (assuming the client app keeps no references to them either). How this fits in with BioJava's "everything is a sequence database" philosophy or not I don't know, as essentially it breaks it by defining a file to be a sequential-access sequence database, rather than a random-access one. Can someone clarify if a lazy-loading parser/database implementation already exists for situations like this, or does one need to be written? cheers, Richard Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Gem Yang > Sent: Friday, July 01, 2005 2:30 AM > To: biojava-l@biojava.org > Subject: [Biojava-l] memory leak while reading nr.fasta > > > Hi, > > I am new to Biojava. > I have the following program, which is copied from ReadFaster2 in the > cookbook. > > public static void main(String[] args) { > try { > // args[0] is nr.fasta > BufferedReader br = new BufferedReader(new > FileReader(args[0])); > > String format = "FASTA"; > String alphabet = "PROTEIN"; > > SequenceIterator iter = > quenceIterator)SeqIOTools.fileToBiojava(format,alphabet, br); > > int count =0; > long start = System.currentTimeMillis(); > while(iter.hasNext()) > { > Sequence s = iter.nextSequence(); > String name = s.getName(); > > //System.out.println(name); > s.getAnnotation(); > //System.out.println(s.seqString()); > count ++; > System.out.println(count); > > } > long end = System.currentTimeMillis(); > System.out.println("number of sequence " + count); > System.out.println("time used" + (end-start)/1000 + > "seconds"); > System.out.println((end-start)/1000/60 + "minutes"); > } > catch (FileNotFoundException ex) { > //can't find file specified by args[0] > ex.printStackTrace(); > }catch (BioException ex) { > //error parsing requested format > ex.printStackTrace(); > } > } > > When running this code, I got out of memory error in about > half an hour and > 1.5GB memory allocated. My workstation is a Windows XP with > 2 GB of memory. > My biojava version is 1.3. My JRE is one came with Websphere > application > developer. > > Thanks. > Gem > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l