Hi all,
I apologize for missing the "next day or so" window but here is my patch. I'm
attaching a patch
file for FastaFormat.java (1.7 tagged branch), the full source file, and a test
class. It seems to
work and performance is on par with the previous approach in my measurements.
One problem that I
couldn't quite figure out a way around is with the
guessSymbolTokenization(BufferedInputStream
stream) method. If the first sequence of the stream has a header length plus
first line of sequence
length longer than 2000 characters it will fail to reset properly.
Cheers,
Josh
Richard Holland wrote:
> I'd love to see a proper solution to this that doesn't involve upping
> the read-ahead limit. I was aware that it might be the issue, but had no
> idea why it was not failing for other similar long sequences. I look
> forward to seeing your suggested fix!
>
> thanks,
> Richard
>
> Josh Goodman wrote:
>> Hi Richard and JP,
>>
>> I think I can be of some help as I'm the FlyBase developer responsible for
>> generating these troublesome FASTA files :-). The cause of this problem
>> appears to be the description line length for the record FBpp0145470.
>>
>> The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop
>> at line 196. Biojava correctly reads in FBpp0145468 but throws an error
>> when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469
>> but when biojava reaches the end of the sequence it reads in the header
>> for the next record (FBpp0145470). It then tries to reset the
>> BufferedReader to the start of FBpp0145470 but that is where the exception
>> is thrown because line 197 sets the read ahead limit to 500 characters and
>> the reader.readLine() command exceeds that limit.
>>
>> What isn't obvious to me is why other large definition lines that precede
>> that line don't throw the same error (e.g. FBpp0157909). I guess the
>> javadoc on BufferedReader.mark() does say "may fail" but I assumed it
>> would be more predictable than that.
>>
>> The file in question can be downloaded from
>> ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz.
>>
>> If there is interest in a solution that doesn't involve simply upping the
>> read ahead limit I can put a patch file together in the next day or so.
>>
>> Cheers,
>> Josh
>>
>> On Tue, 28 Apr 2009, Richard Holland wrote:
>>
>>> You're right, doesn't look like newlines.
>>>
>>> The "Mark invalid" happens when the parser looks too far ahead in the
>>> file attempting to seek out the next valid sequence to parse. I'm not
>>> sure why this is happening.
>>>
>>> I don't have the time to test right now but if you could post the link
>>> to where someone could download the same FASTA as you're using, then it
>>> would make it possible for someone else to investigate in more detail.
>>>
>>> thanks,
>>> Richard
>>>
>>> JP wrote:
>>>> Thanks Richard for your prompt reply.
>>>>
>>>> I will not attach the fasta file I am parsing (12MB) its
>>>> dgri-all-translation-r1.3.fasta from the flybase project.
>>>>
>>>> If the file had any extra new lines I would see them when I loaded it in
>>>> a text editor - no ?
>>>>
>>>> I implemented the whole thing without using Biojava (for this part)
>>>>
>>>> fr = new FileReader(fastaProteinFileName);
>>>> br = new BufferedReader(fr);
>>>> String fastaLine;
>>>> String startAccession = '>' + accessionId.trim();
>>>> String fastaEntry = "";
>>>> boolean record = false;
>>>> while ((fastaLine = br.readLine()) != null) {
>>>> fastaLine = fastaLine.trim() + '\n';
>>>> if (fastaLine.startsWith(startAccession)) {
>>>> record = true;
>>>> } else if (record && fastaLine.startsWith(">")) {
>>>> record = false;
>>>> break;
>>>> }
>>>> if (record) {
>>>> fastaEntry += fastaLine;
>>>> }
>>>> }
>>>>
>>>>
>>>> Notice - I do not use regex - since I'd need to read the whole file and
>>>> then regex upon it (if the record is the first one - I just read that one).
>>>>
>>>> Cheers
>>>> JP
>>>>
>>>>
>>>> On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>
>>>> The "Mark invalid" exception is indicating that the parser has gone too
>>>> far ahead in the file looking for a valid header. I'm not sure why but
>>>> looking at your original query, there may be extra newlines embedded
>>>> into your FASTA header line? That would definitely confuse it.
>>>>
>>>> The parser is not able to currently pull out just one sequence - in
>>>> effect this is a search facility, which it doesn't have. :(
>>>>
>>>> thanks,
>>>> Richard
>>>>
>>>> JP wrote:
>>>> > Hi all at BioJava,
>>>> >
>>>> > I am trying to parse several FASTA files using the following code:
>>>> >
>>>> > fr = new FileReader(fastaProteinFileName);
>>>> >> br = new BufferedReader(fr);
>>>> >>
>>>> >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
>>>> >> while (protIter.hasNext()) {
>>>> >> BioEntry bioEntry = protIter.nextBioEntry();
>>>> >> System.out.println (fastaProteinFileName + " == " +
>>>> accessionId + " =
>>>> >> " + bioEntry.getAccession());
>>>> >> }
>>>> >
>>>> >
>>>> > At particular points in my fasta file - I get the following
>>>> exception:
>>>> >
>>>> > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception
>>>> (from
>>>> >> biojava library)
>>>> >> org.biojava.bio.BioException: Could not read sequence
>>>> >> at
>>>> >>
>>>>
>>>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>>>> >> at
>>>> >>
>>>>
>>>> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
>>>> >> at
>>>> >>
>>>>
>>>> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
>>>> >> at
>>>> >>
>>>>
>>>> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
>>>> >> at
>>>> >>
>>>>
>>>> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
>>>> >> at
>>>> >>
>>>>
>>>> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
>>>> >> Caused by: java.io.IOException: Mark invalid
>>>> >> at java.io.BufferedReader.reset(Unknown Source)
>>>> >> at
>>>> >>
>>>>
>>>> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>>>> >> at
>>>> >>
>>>>
>>>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>>>> >> ... 5 more
>>>> >
>>>> >
>>>> > Interestingly if I delete the header portion of the header line (from
>>>> > type=protein... till the end of the line ...Dgri;)
>>>> >
>>>> >> FBpp0145468 type=protein;
>>>> >>
>>>>
>>>> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
>>>> >> ID=FBpp0145468; name=Dgri\GH11562-PA;
>>>> parent=FBgn0119042,FBtr0146976;
>>>> >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
>>>> >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
>>>> >> species=Dgri;
>>>> >>
>>>> >
>>>> > It works - but I have a number of these exceptions (and I do not
>>>> want to
>>>> > edit the original data). Mind you I have longer headers in my
>>>> file which
>>>> > are parsed OK (strange!).
>>>> >
>>>> > Any ideas anyone ? Alternatively - is there a better way how to
>>>> get ONE
>>>> > SINGLE sequence from the whole fasta file give that I have the
>>>> accession id
>>>> > (FBpp0145468) ?
>>>> >
>>>> > Many Thanks
>>>> > JP
>>>> > _______________________________________________
>>>> > Biojava-l mailing list - [email protected]
>>>> <mailto:[email protected]>
>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> >
>>>>
>>>> --
>>>> Richard Holland, BSc MBCS
>>>> Finance Director, Eagle Genomics Ltd
>>>> T: +44 (0)1223 654481 ext 3 | E: [email protected]
>>>> <mailto:[email protected]>
>>>> http://www.eaglegenomics.com/
>>>>
>>>>
>>> --
>>> Richard Holland, BSc MBCS
>>> Finance Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: [email protected]
>>> http://www.eaglegenomics.com/
>>> _______________________________________________
>>> Biojava-l mailing list - [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
Index: src/org/biojavax/bio/seq/io/FastaFormat.java
===================================================================
--- src/org/biojavax/bio/seq/io/FastaFormat.java (revision 6947)
+++ src/org/biojavax/bio/seq/io/FastaFormat.java Tue May 12 15:46:40 EDT 2009
@@ -21,17 +21,6 @@
package org.biojavax.bio.seq.io;
-import java.io.BufferedInputStream;
-import java.io.BufferedReader;
-import java.io.File;
-import java.io.FileReader;
-import java.io.IOException;
-import java.io.InputStreamReader;
-import java.io.PrintStream;
-import java.util.Map;
-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
import org.biojava.bio.seq.Sequence;
import org.biojava.bio.seq.io.ParseException;
import org.biojava.bio.seq.io.SeqIOListener;
@@ -46,7 +35,12 @@
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.RichSequence;
+import java.io.*;
+import java.nio.CharBuffer;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
/**
* Format object representing FASTA files. These files are almost pure
* sequence data.
@@ -115,10 +109,11 @@
* A stream is in FASTA format if the stream starts with ">".
*/
public boolean canRead(BufferedInputStream stream) throws IOException {
- stream.mark(2000); // some streams may not support this
+ stream.mark(5); // some streams may not support this
BufferedReader br = new BufferedReader(new InputStreamReader(stream));
- String firstLine = br.readLine();
- boolean readable = firstLine!=null && firstLine.startsWith(">");
+ CharBuffer cb = CharBuffer.allocate(1);
+ int numChars = br.read(cb);
+ boolean readable = numChars > 0 && cb.get(0) == '>';
// don't close the reader as it'll close the stream too.
// br.close();
stream.reset();
@@ -191,20 +186,35 @@
processHeader(line,rsiol,ns);
- StringBuffer seq = new StringBuffer();
- boolean hasMoreSeq = true;
+ StringBuffer seq = new StringBuffer(); //Buffer to hold sequence data.
+ CharBuffer cb = CharBuffer.allocate(1); //CharBuffer to hold a single character look ahead.
+ boolean hasMoreSeq = true; //Boolean to control iterating over sequence lines.
+ int numChars = 0; //Initializing int to track read status.
+
while (hasMoreSeq) {
- reader.mark(500);
+ //Mark the current buffer, read in a single character,
+ //and then reset the buffer back to the previous position.
+ reader.mark(5);
+ numChars = reader.read(cb);
+ reader.reset();
+
+ //Do this if we successfully read in a character.
+ if (numChars > 0) {
+ //Exit the while loop if we reach the start of the next FASTA record.
+ if (cb.get(0) == '>') {
+ hasMoreSeq = false;
+ }
+ //Otherwise read the entire line and store the sequence data.
+ else {
- line = reader.readLine();
+ line = reader.readLine();
- if (line!=null) {
+ if (line != null) {
- line = line.trim();
+ line = line.trim();
- if (line.length() > 0 && line.charAt(0)=='>') {
- reader.reset();
- hasMoreSeq = false;
- } else {
- seq.append(line);
- }
+ seq.append(line);
+ }
- } else {
+ }
+ }
+ //Exit the while loop if we have reached the end of the file.
+ else {
hasMoreSeq = false;
}
}
@@ -225,7 +235,7 @@
rsiol.endSequence();
- return line!=null;
+ return numChars > 0;
}
/** Parse the Header information from the Fasta Description line
/*
* BioJava development code
*
* This code may be freely distributed and modified under the
* terms of the GNU Lesser General Public Licence. This should
* be distributed with the code. If you do not have a copy,
* see:
*
* http://www.gnu.org/copyleft/lesser.html
*
* Copyright for this code is held jointly by the individual
* authors. These should be listed in @author doc comments.
*
* For more information on the BioJava project and its aims,
* or to join the biojava-l mailing list, visit the home page
* at:
*
* http://www.biojava.org/
*
*/
package org.biojavax.bio.seq.io;
import org.biojava.bio.seq.Sequence;
import org.biojava.bio.seq.io.ParseException;
import org.biojava.bio.seq.io.SeqIOListener;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojava.bio.symbol.IllegalSymbolException;
import org.biojava.bio.symbol.SimpleSymbolList;
import org.biojava.bio.symbol.Symbol;
import org.biojava.bio.symbol.SymbolList;
import org.biojava.utils.ChangeVetoException;
import org.biojavax.Namespace;
import org.biojavax.RichObjectFactory;
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.RichSequence;
import java.io.*;
import java.nio.CharBuffer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Format object representing FASTA files. These files are almost pure
* sequence data.
* @author Thomas Down
* @author Matthew Pocock
* @author Greg Cox
* @author Lukas Kall
* @author Richard Holland
* @author Mark Schreiber
* @since 1.5
*/
public class FastaFormat extends RichSequenceFormat.HeaderlessFormat {
// Register this format with the format auto-guesser.
static {
RichSequence.IOTools.registerFormat(FastaFormat.class);
}
/**
* The name of this format
*/
public static final String FASTA_FORMAT = "FASTA";
// header line
protected static final Pattern hp = Pattern.compile(">\\s*(\\S+)(\\s+(.*))?");
// description chunk
protected static final Pattern dp = Pattern.compile( "^(gi\\|(\\d+)\\|)?(\\w+)\\|(\\w+?)(\\.(\\d+))?\\|(\\w+)?$");
protected static final Pattern readableFiles = Pattern.compile(".*(fa|fas)$");
protected static final Pattern aminoAcids = Pattern.compile(".*[FLIPQE].*");
private FastaHeader header = new FastaHeader();
/**
* {...@inheritdoc}
* A file is in FASTA format if the name ends with fa or fas, or the file starts with ">".
*/
@Override
public boolean canRead(File file) throws IOException {
if (readableFiles.matcher(file.getName()).matches()) return true;
BufferedReader br = new BufferedReader(new FileReader(file));
String firstLine = br.readLine();
boolean readable = firstLine!=null && firstLine.startsWith(">");
br.close();
return readable;
}
/**
* {...@inheritdoc}
* Returns an protein parser if the first line of sequence contains any of F/L/I/P/Q/E,
* otherwise returns a DNA tokenizer.
*/
@Override
public SymbolTokenization guessSymbolTokenization(File file) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(file));
br.readLine(); // discard first line
boolean aa = aminoAcids.matcher(br.readLine()).matches();
br.close();
if (aa) return RichSequence.IOTools.getProteinParser();
else return RichSequence.IOTools.getDNAParser();
}
/**
* {...@inheritdoc}
* A stream is in FASTA format if the stream starts with ">".
*/
public boolean canRead(BufferedInputStream stream) throws IOException {
stream.mark(5); // some streams may not support this
BufferedReader br = new BufferedReader(new InputStreamReader(stream));
CharBuffer cb = CharBuffer.allocate(1);
int numChars = br.read(cb);
boolean readable = numChars > 0 && cb.get(0) == '>';
// don't close the reader as it'll close the stream too.
// br.close();
stream.reset();
return readable;
}
/**
* {...@inheritdoc}
* Returns an protein parser if the first line of sequence contains any of F/L/I/P/Q/E,
* otherwise returns a DNA tokenizer.
*/
public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream) throws IOException {
stream.mark(2000); // some streams may not support this
BufferedReader br = new BufferedReader(new InputStreamReader(stream));
br.readLine(); // discard first line
boolean aa = aminoAcids.matcher(br.readLine()).matches();
// don't close the reader as it'll close the stream too.
// br.close();
stream.reset();
if (aa) return RichSequence.IOTools.getProteinParser();
else return RichSequence.IOTools.getDNAParser();
}
/**
* {...@inheritdoc}
*/
public boolean readSequence(
BufferedReader reader,
SymbolTokenization symParser,
SeqIOListener listener
) throws
IllegalSymbolException,
IOException,
ParseException {
if (!(listener instanceof RichSeqIOListener)) throw new IllegalArgumentException("Only accepting RichSeqIOListeners today");
return this.readRichSequence(reader,symParser,(RichSeqIOListener)listener,null);
}
/**
* {...@inheritdoc}
* If namespace is null, then the namespace of the sequence in the fasta is used.
* If the namespace is null and so is the namespace of the sequence in the fasta,
* then the default namespace is used.
*/
public boolean readRichSequence(
BufferedReader reader,
SymbolTokenization symParser,
RichSeqIOListener rsiol,
Namespace ns
) throws
IllegalSymbolException,
IOException,
ParseException {
String line = reader.readLine();
if (line == null) {
throw new IOException("Premature stream end");
}
while(line.length() == 0) {
line = reader.readLine();
if (line == null) {
throw new IOException("Premature stream end");
}
}
if (!line.startsWith(">")) {
throw new IOException("Stream does not appear to contain FASTA formatted data: " + line);
}
rsiol.startSequence();
processHeader(line,rsiol,ns);
StringBuffer seq = new StringBuffer(); //Buffer to hold sequence data.
CharBuffer cb = CharBuffer.allocate(1); //CharBuffer to hold a single character look ahead.
boolean hasMoreSeq = true; //Boolean to control iterating over sequence lines.
int numChars = 0; //Initializing int to track read status.
while (hasMoreSeq) {
//Mark the current buffer, read in a single character,
//and then reset the buffer back to the previous position.
reader.mark(5);
numChars = reader.read(cb);
reader.reset();
//Do this if we successfully read in a character.
if (numChars > 0) {
//Exit the while loop if we reach the start of the next FASTA record.
if (cb.get(0) == '>') {
hasMoreSeq = false;
}
//Otherwise read the entire line and store the sequence data.
else {
line = reader.readLine();
if (line != null) {
line = line.trim();
seq.append(line);
}
}
}
//Exit the while loop if we have reached the end of the file.
else {
hasMoreSeq = false;
}
}
if (!this.getElideSymbols()) {
try {
SymbolList sl = new SimpleSymbolList(symParser,
seq.toString().replaceAll("\\s+","").replaceAll("[\\.|~]","-"));
rsiol.addSymbols(symParser.getAlphabet(),
(Symbol[])(sl.toList().toArray(new Symbol[0])),
0, sl.length());
} catch (Exception e) {
// do not know name and gi any longer, replace them with empty string.
// why does the rsiol only have setter methods, but not getter???
String message = ParseException.newMessage(this.getClass(), "", "", "problem parsing symbols", seq.toString());
throw new ParseException(e, message);
}
}
rsiol.endSequence();
return numChars > 0;
}
/** Parse the Header information from the Fasta Description line
*
* @param line
* @param rsiol
* @param ns
* @throws IOException
* @throws ParseException
*/
public void processHeader(String line,RichSeqIOListener rsiol,Namespace ns)
throws IOException, ParseException {
Matcher m = hp.matcher(line);
if (!m.matches()) {
throw new IOException("Stream does not appear to contain FASTA formatted data: " + line);
}
String name = m.group(1);
String desc = m.group(3);
String gi = null;
m = dp.matcher(name);
if (m.matches()) {
gi = m.group(2);
String namespace = m.group(3);
String accession = m.group(4);
String verString = m.group(6);
int version = verString==null?0:Integer.parseInt(verString);
name = m.group(7);
if (name==null) name=accession;
rsiol.setAccession(accession);
rsiol.setVersion(version);
if (gi!=null) rsiol.setIdentifier(gi);
if (ns==null) rsiol.setNamespace((Namespace)RichObjectFactory.getObject(SimpleNamespace.class,new Object[]{namespace}));
else rsiol.setNamespace(ns);
} else {
rsiol.setAccession(name);
rsiol.setNamespace((ns==null?RichObjectFactory.getDefaultNamespace():ns));
}
rsiol.setName(name);
if (!this.getElideComments()) rsiol.setDescription(desc);
}
/**
* {...@inheritdoc}
*/
public void writeSequence(Sequence seq, PrintStream os) throws IOException {
if (this.getPrintStream()==null) this.setPrintStream(os);
this.writeSequence(seq, RichObjectFactory.getDefaultNamespace());
}
/**
* {...@inheritdoc}
*/
public void writeSequence(Sequence seq, String format, PrintStream os) throws IOException {
if (this.getPrintStream()==null) this.setPrintStream(os);
if (!format.equals(this.getDefaultFormat())) throw new IllegalArgumentException("Unknown format: "+format);
this.writeSequence(seq, RichObjectFactory.getDefaultNamespace());
}
/**
* {...@inheritdoc}
* If namespace is null, then the sequence's own namespace is used.
*/
public void writeSequence(Sequence seq, Namespace ns) throws IOException {
RichSequence rs;
try {
if (seq instanceof RichSequence) rs = (RichSequence)seq;
else rs = RichSequence.Tools.enrich(seq);
} catch (ChangeVetoException e) {
IOException e2 = new IOException("Unable to enrich sequence");
e2.initCause(e);
throw e2;
}
StringBuilder sb = new StringBuilder();
sb.append(">");
String identifier = rs.getIdentifier();
if (header.isShowIdentifier() && identifier!=null && !"".equals(identifier)) {
sb.append("gi|");
sb.append(identifier);
sb.append("|");
}
if(header.isShowNamespace()){
sb.append((ns==null?rs.getNamespace().getName():ns.getName()));
sb.append("|");
}
if(header.isShowAccession()){
sb.append(rs.getAccession());
if(header.isShowVersion()){
sb.append(".");
}
}
if(header.isShowVersion()){
sb.append(rs.getVersion());
sb.append("|");
}
if(header.isShowName()){
sb.append(rs.getName());
sb.append(" ");
}else{
sb.append(" "); //in case the show the description there needs to be space
}
if(header.isShowDescription()){
String desc = rs.getDescription();
if (desc!=null && !"".equals(desc)) sb.append(desc.replaceAll("\\n"," "));
}
if(sb.charAt(sb.length() -1) == '|'){
sb.deleteCharAt(sb.length() -1);
}
this.getPrintStream().print(sb.toString());
this.getPrintStream().println();
int length = rs.length();
for (int pos = 1; pos <= length; pos += this.getLineWidth()) {
int end = Math.min(pos + this.getLineWidth() - 1, length);
this.getPrintStream().println(rs.subStr(pos, end));
}
}
/**
* {...@inheritdoc}
*/
public String getDefaultFormat() {
return FASTA_FORMAT;
}
public FastaHeader getHeader() {
return header;
}
public void setHeader(FastaHeader header) {
this.header = header;
}
}
package org.biojavax.bio.seq.io;
import junit.framework.TestCase;
import org.biojava.bio.symbol.Alphabet;
import org.biojava.bio.symbol.AlphabetManager;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.Namespace;
import org.biojavax.RichObjectFactory;
import org.biojavax.Note;
import org.biojavax.ontology.ComparableTerm;
import java.io.*;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Set;
import java.util.Iterator;
/**
* Tests for FastaFormat.
*
* @author Josh Goodman
*/
public class FastaFormatTest extends TestCase {
private FastaFormat fastaFormat;
/**
* @see junit.framework.TestCase#setUp()
*/
protected void setUp() {
this.fastaFormat = new FastaFormat();
}
private RichSequence readFile(String filename) {
InputStream inStream = this.getClass().getResourceAsStream(filename);
BufferedReader br = new BufferedReader(new InputStreamReader(inStream));
SymbolTokenization tokenization = RichSequence.IOTools.getProteinParser();
Namespace namespace = RichObjectFactory.getDefaultNamespace();
SimpleRichSequenceBuilder builder = new SimpleRichSequenceBuilder();
RichSequence sequence = null;
try {
fastaFormat.readRichSequence(br, tokenization, builder, namespace);
sequence = builder.makeRichSequence();
} catch (Exception e) {
e.printStackTrace();
fail("Unexpected exception: "+e);
}
return sequence;
}
public void testCanReadFile() {
try {
URL file = this.getClass().getResource("/files/AAL039263.fa");
assertTrue(fastaFormat.canRead(new File(file.toURI())));
} catch (URISyntaxException e) {
e.printStackTrace();
fail("URI Syntax exception: " + e);
} catch (IOException e) {
e.printStackTrace();
fail("IO exception: " + e);
}
}
public void testCanReadStream() {
try {
InputStream stream = this.getClass().getResourceAsStream("/files/AAL039263.fa");
assertTrue(fastaFormat.canRead(new BufferedInputStream(stream)));
stream.close();
} catch (IOException e) {
e.printStackTrace();
fail("IO exception: " + e);
}
}
public void testGuessSymbolTokenizationFile() {
try {
URL file = this.getClass().getResource("/files/AAL039263.fa");
Alphabet prot1 = AlphabetManager.alphabetForName("PROTEIN-TERM");
Alphabet prot2 = fastaFormat.guessSymbolTokenization(new File(file.toURI())).getAlphabet();
assertTrue(prot1.equals(prot2));
} catch (IOException e) {
e.printStackTrace();
fail("IO exception: " + e);
} catch (URISyntaxException e) {
e.printStackTrace();
fail("URI Syntax exception: " + e);
}
}
public void testGuessSymbolTokenizationStream() {
try {
InputStream stream = this.getClass().getResourceAsStream("/files/AAL039263.fa");
Alphabet prot1 = AlphabetManager.alphabetForName("PROTEIN-TERM");
Alphabet prot2 = fastaFormat.guessSymbolTokenization(new BufferedInputStream(stream)).getAlphabet();
assertTrue(prot1.equals(prot2));
} catch (IOException e) {
e.printStackTrace();
fail("IO exception: " + e);
}
}
public void testReadFastaFormat() {
RichSequence sequence = readFile("/files/AAL039263.fa");
assertNotNull(sequence);
assertEquals(sequence.getName(), "AAL39263");
assertEquals(sequence.getAccession(), "AAL39263");
assertEquals(sequence.getInternalSymbolList().length(), 70);
}
}_______________________________________________
Biojava-l mailing list - [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l