Re: [Biojava-l] Error parsing ipi.HUMAN.fasta file

Chris Cole Mon, 11 Jan 2010 06:59:38 -0800

[ posting back to biojava-l as omitted the address previously ]

Ah, right. It wasn't clear on the wiki whether those were included ornot with the 'all' package.

It compiles fine now (with one warning) and through trial and error abuffer value of 2000 works with ipi.HUMAN.fasta as well as mouse andchicken.


Thanks very much for your help.

Chris

On 11/01/10 12:31, Richard Holland wrote:

Hello. You need to make sure the support libraries are also on your classpath:

http://www.biojava.org/wiki/BioJava:Download#Support_libraries

cheers,
Richard

On 11 Jan 2010, at 12:16, Chris Cole wrote:

Thanks for the reply, Richard.

Just getting back to this problem. I've upped the buffer to 1000 bytes, but I 
can't get it to compile with ant. I get a whole slew of compile errors, there 
seems to be something missing, but I don't know how to solve it. Output from 
ant build follows:

caterpillar: ~/Downloads/biojava-1.7/src>  ant -f ../build.xml
Buildfile: ../build.xml

init:
     [echo] Building biojava-1.7
     [echo] Java Home:                       /usr/java/jdk1.6.0_17/jre
     [echo] JUnit present:                   ${junit.present}
     [echo] JUnit supported by Ant:          ${junit.support}
     [echo] HSQLDB driver present:           ${sqlDriver.hsqldb}
     [echo] XSLT support:                    true

prepare:

prepare-biojava:

compile-biojava:
    [javac] Compiling 1462 source files to 
/opt/Downloads/biojava-1.7/ant-build/classes/biojava
    [javac] 
/opt/Downloads/biojava-1.7/src/org/biojava/bio/dp/twohead/DPCompiler.java:55: 
package org.biojava.utils.bytecode does not exist
    [javac] import org.biojava.utils.bytecode.ByteCode;
    [javac]                                  ^
    [javac] 
/opt/Downloads/biojava-1.7/src/org/biojava/bio/dp/twohead/DPCompiler.java:56: 
package org.biojava.utils.bytecode does not exist
    [javac] import org.biojava.utils.bytecode.CodeClass;
    [javac]                                  ^
    [javac] 
/opt/Downloads/biojava-1.7/src/org/biojava/bio/dp/twohead/DPCompiler.java:57: 
package org.biojava.utils.bytecode does not exist
    [javac] import org.biojava.utils.bytecode.CodeException;
    [javac]                                  ^
...etc.

I downloaded the biojava-1.7-all.jar, originally and I can't what else I need?

I'm also trying to do this from within Eclipse, so any Eclipse-specific 
pointers would be much appreciated.
Cheers,

Chris

On 18/12/09 16:58, Richard Holland wrote:

The FASTA parser has a buffer which it uses to read ahead to the next
complete line then back up before it actually parses it on the second
pass (in order to allow it to do things like hasNext()). The
exception shows that the size of that buffer is being exceeded,
causing it to fail to back up again afterwards.

There's two cures - one is to rewrite the FASTA parser to buffer
things in a different way. The other is to open up
org/biojavax/bio/seq/io/FastaFormat.java in a text editor, search for
the line where it sets the buffer (somewhere around line 202
according to the exception, in the readRichSequence() method - the
command to look for is 'mark'), and increase the buffer size to
something suitably large enough (it's currently set at 500 bytes).
Then recompile BioJava and it should work.

cheers, Richard

On 18 Dec 2009, at 15:53, Chris Cole wrote:

I'm wanting to parse a fasta file obtained from IPI using the code
at the bottom of this message, but I get the following error:

org.biojava.bio.BioException: Could not read sequence at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)

at test.readFasta(test.java:39)

at test.main(test.java:18) Caused by: java.io.IOException: Mark
invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at
org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)

at 
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)

... 2 more

Looking at the Fasta file itself and doing some tests, it seems to
fail consistently at one or two entries /preceding/ an entry with a
very long description line e.g.:

IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394
Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase
PPT2

MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN
ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL
LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH
DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG
FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW LS

Deleting the large entries allows the code to continue until it
reaches another long description line.

It also seems to be a feature of large Fasta files as reading the
above sequence alone or as part of a small file is fine.

Is this a known problem or am I doing something wrong? BTW I'm
using biojava 1.7 and Java 1.6.0_17. Any help would be most
appreciated. Cheers.

code: import java.io.*;

import org.biojava.bio.*; import org.biojavax.*; import
org.biojavax.bio.seq.*;

public class test { private static PrintStream o = System.out;

public static void main(String[] args) { // TODO Auto-generated
method stub readFasta(args[0]); }  public static void
readFasta(String filename) { try { o.println("Reading file: " +
filename); //prepare a BufferedReader for file io BufferedReader br
= new BufferedReader(new FileReader(filename));

// read Fasta file as BioJava RichSequence object Namespace ns =
RichObjectFactory.getDefaultNamespace(); RichSequenceIterator iter
= RichSequence.IOTools.readFastaProtein(br,ns);

int numProteins = 0; while(iter.hasNext()) { ++numProteins;

// Retrieve sequence and description data RichSequence seq =
iter.nextRichSequence(); String ipi =
seq.getName().substring(4,15); o.println(ipi);  } o.println("Found
" + numProteins + " in Fasta file"); } catch (FileNotFoundException
ex) { //can't find file specified by args[0] ex.printStackTrace();
} catch (BioException ex) { //error parsing requested format
ex.printStackTrace(); } }

}


--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: [email protected]
http://www.eaglegenomics.com/



--
Dr Chris Cole
Senior Bioinformatics Research Officer
School of Life Sciences Research
University of Dundee
Dow Street
Dundee
DD1 5EH
Scotland, UK

url: http://network.nature.com/profile/drchriscole
e-mail: [email protected]
Tel: +44 (0)1382 388 721

The University of Dundee is a registered Scottish charity, No: SC015096
_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] Error parsing ipi.HUMAN.fasta file

Reply via email to