I'm wanting to parse a fasta file obtained from IPI using the code at the bottom of this message, but I get the following error:

org.biojava.bio.BioException: Could not read sequence
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
        at test.readFasta(test.java:39)
        at test.main(test.java:18)
Caused by: java.io.IOException: Mark invalid
        at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
        ... 2 more

Looking at the Fasta file itself and doing some tests, it seems to fail consistently at one or two entries /preceding/ an entry with a very long description line e.g.:
>IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394
 Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase PPT2
MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN
ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL
LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH
DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG
FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW
LS

Deleting the large entries allows the code to continue until it reaches another long description line.

It also seems to be a feature of large Fasta files as reading the above sequence alone or as part of a small file is fine.

Is this a known problem or am I doing something wrong? BTW I'm using biojava 1.7 and Java 1.6.0_17.
Any help would be most appreciated.
Cheers.

code:
import java.io.*;

import org.biojava.bio.*;
import org.biojavax.*;
import org.biojavax.bio.seq.*;

public class test {
   private static PrintStream o = System.out;

   public static void main(String[] args) {
      // TODO Auto-generated method stub
      readFasta(args[0]);
   }
        
   public static void readFasta(String filename) {
      try {
         o.println("Reading file: " + filename);
         //prepare a BufferedReader for file io
         BufferedReader br = new BufferedReader(new FileReader(filename));

         // read Fasta file as BioJava RichSequence object
         Namespace ns = RichObjectFactory.getDefaultNamespace();
RichSequenceIterator iter = RichSequence.IOTools.readFastaProtein(br,ns);

         int numProteins = 0;
         while(iter.hasNext()) {
            ++numProteins;

            // Retrieve sequence and description data
            RichSequence seq = iter.nextRichSequence();
            String ipi = seq.getName().substring(4,15);
            o.println(ipi);
                        
         }
         o.println("Found " + numProteins + " in Fasta file");
     } catch (FileNotFoundException ex) {
        //can't find file specified by args[0]
        ex.printStackTrace();
     } catch (BioException ex) {
        //error parsing requested format
        ex.printStackTrace();
   }
}

}
_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to