This is why we all love the internet and the community. What is the chance of this happening ? You are speaking about World Peace, and Kofi Annan butts in. :)
I found that strange also (that there are larger headers preceding the troublesome one). Maybe (and this is a long shot) there is some buffer which gets filled at that particular record or point in file ? (Does the error move record if we delete a couple of initial Fasta entries ?) Mind you this is NOT the only flybase fasta file I get errors with (same happens with dpse one v2.3 - and I am sure there are others). I am interested in the solution, so are a ton of other people who use biojava and particularly verbose fasta files. I love flybase and biojava JP On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman <[email protected]> wrote: > > Hi Richard and JP, > > I think I can be of some help as I'm the FlyBase developer responsible for > generating these troublesome FASTA files :-). The cause of this problem > appears to be the description line length for the record FBpp0145470. > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > at line 196. Biojava correctly reads in FBpp0145468 but throws an error > when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 > but when biojava reaches the end of the sequence it reads in the header > for the next record (FBpp0145470). It then tries to reset the > BufferedReader to the start of FBpp0145470 but that is where the exception > is thrown because line 197 sets the read ahead limit to 500 characters and > the reader.readLine() command exceeds that limit. > > What isn't obvious to me is why other large definition lines that precede > that line don't throw the same error (e.g. FBpp0157909). I guess the > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > would be more predictable than that. > > The file in question can be downloaded from > > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz > . > > If there is interest in a solution that doesn't involve simply upping the > read ahead limit I can put a patch file together in the next day or so. > > Cheers, > Josh > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > You're right, doesn't look like newlines. > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > file attempting to seek out the next valid sequence to parse. I'm not > > sure why this is happening. > > > > I don't have the time to test right now but if you could post the link > > to where someone could download the same FASTA as you're using, then it > > would make it possible for someone else to investigate in more detail. > > > > thanks, > > Richard > > > > JP wrote: > > > Thanks Richard for your prompt reply. > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > If the file had any extra new lines I would see them when I loaded it > in > > > a text editor - no ? > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > fr = new FileReader(fastaProteinFileName); > > > br = new BufferedReader(fr); > > > String fastaLine; > > > String startAccession = '>' + accessionId.trim(); > > > String fastaEntry = ""; > > > boolean record = false; > > > while ((fastaLine = br.readLine()) != null) { > > > fastaLine = fastaLine.trim() + '\n'; > > > if (fastaLine.startsWith(startAccession)) { > > > record = true; > > > } else if (record && fastaLine.startsWith(">")) { > > > record = false; > > > break; > > > } > > > if (record) { > > > fastaEntry += fastaLine; > > > } > > > } > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > > then regex upon it (if the record is the first one - I just read that > one). > > > > > > Cheers > > > JP > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > <[email protected] <mailto:[email protected]>> wrote: > > > > > > The "Mark invalid" exception is indicating that the parser has gone > too > > > far ahead in the file looking for a valid header. I'm not sure why > but > > > looking at your original query, there may be extra newlines > embedded > > > into your FASTA header line? That would definitely confuse it. > > > > > > The parser is not able to currently pull out just one sequence - in > > > effect this is a search facility, which it doesn't have. :( > > > > > > thanks, > > > Richard > > > > > > JP wrote: > > > > Hi all at BioJava, > > > > > > > > I am trying to parse several FASTA files using the following > code: > > > > > > > > fr = new FileReader(fastaProteinFileName); > > > >> br = new BufferedReader(fr); > > > >> > > > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > null); > > > >> while (protIter.hasNext()) { > > > >> BioEntry bioEntry = protIter.nextBioEntry(); > > > >> System.out.println (fastaProteinFileName + " == " + > > > accessionId + " = > > > >> " + bioEntry.getAccession()); > > > >> } > > > > > > > > > > > > At particular points in my fasta file - I get the following > exception: > > > > > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception > (from > > > >> biojava library) > > > >> org.biojava.bio.BioException: Could not read sequence > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > > >> Caused by: java.io.IOException: Mark invalid > > > >> at java.io.BufferedReader.reset(Unknown Source) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > > >> ... 5 more > > > > > > > > > > > > Interestingly if I delete the header portion of the header line > (from > > > > type=protein... till the end of the line ...Dgri;) > > > > > > > >> FBpp0145468 type=protein; > > > >> > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > > >> ID=FBpp0145468; name=Dgri\GH11562-PA; > parent=FBgn0119042,FBtr0146976; > > > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > >> species=Dgri; > > > >> > > > > > > > > It works - but I have a number of these exceptions (and I do not > > > want to > > > > edit the original data). Mind you I have longer headers in my > > > file which > > > > are parsed OK (strange!). > > > > > > > > Any ideas anyone ? Alternatively - is there a better way how to > > > get ONE > > > > SINGLE sequence from the whole fasta file give that I have the > > > accession id > > > > (FBpp0145468) ? > > > > > > > > Many Thanks > > > > JP > > > > _______________________________________________ > > > > Biojava-l mailing list - [email protected] > > > <mailto:[email protected]> > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > -- > > > Richard Holland, BSc MBCS > > > Finance Director, Eagle Genomics Ltd > > > T: +44 (0)1223 654481 ext 3 | E: [email protected] > > > <mailto:[email protected]> > > > http://www.eaglegenomics.com/ > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: [email protected] > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - [email protected] > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
