Re: [Biojava-l] FASTA parsing bug ?

Josh Goodman Fri, 02 Oct 2009 08:00:24 -0700

Hi all,

I apologize for missing the "next day or so" window but here is my patch.  I'm 
attaching a patch
file for FastaFormat.java (1.7 tagged branch), the full source file, and a test 
class.  It seems to
work and performance is on par with the previous approach in my measurements.  
One problem that I
couldn't quite figure out a way around is with the 
guessSymbolTokenization(BufferedInputStream
stream) method.  If the first sequence of the stream has a header length plus 
first line of sequence
length longer than 2000 characters it will fail to reset properly.


Cheers,
Josh

Richard Holland wrote:
> I'd love to see a proper solution to this that doesn't involve upping
> the read-ahead limit. I was aware that it might be the issue, but had no
> idea why it was not failing for other similar long sequences. I look
> forward to seeing your suggested fix!
> 
> thanks,
> Richard
> 
> Josh Goodman wrote:
>> Hi Richard and JP,
>>
>> I think I can be of some help as I'm the FlyBase developer responsible for 
>> generating these troublesome FASTA files :-).  The cause of this problem 
>> appears to be the description line length for the record FBpp0145470.
>>
>> The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop 
>> at line 196.  Biojava correctly reads in FBpp0145468 but throws an error 
>> when trying to parse FBpp0145469.  There is nothing wrong in FBpp0145469 
>> but when biojava reaches the end of the sequence it reads in the header 
>> for the next record (FBpp0145470).  It then tries to reset the 
>> BufferedReader to the start of FBpp0145470 but that is where the exception 
>> is thrown because line 197 sets the read ahead limit to 500 characters and 
>> the reader.readLine() command exceeds that limit.
>>
>> What isn't obvious to me is why other large definition lines that precede 
>> that line don't throw the same error (e.g. FBpp0157909).  I guess the 
>> javadoc on BufferedReader.mark() does say "may fail" but I assumed it 
>> would be more predictable than that.
>>
>> The file in question can be downloaded from 
>> ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz.
>>
>> If there is interest in a solution that doesn't involve simply upping the 
>> read ahead limit I can put a patch file together in the next day or so.
>>
>> Cheers,
>> Josh
>>
>> On Tue, 28 Apr 2009, Richard Holland wrote:
>>
>>> You're right, doesn't look like newlines.
>>>
>>> The "Mark invalid" happens when the parser looks too far ahead in the
>>> file attempting to seek out the next valid sequence to parse. I'm not
>>> sure why this is happening.
>>>
>>> I don't have the time to test right now but if you could post the link
>>> to where someone could download the same FASTA as you're using, then it
>>> would make it possible for someone else to investigate in more detail.
>>>
>>> thanks,
>>> Richard
>>>
>>> JP wrote:
>>>> Thanks Richard for your prompt reply.
>>>>
>>>> I will not attach the fasta file I am parsing (12MB) its
>>>> dgri-all-translation-r1.3.fasta from the flybase project.
>>>>
>>>> If the file had any extra new lines I would see them when I loaded it in
>>>> a text editor - no ?
>>>>
>>>> I implemented the whole thing without using Biojava (for this part)
>>>>
>>>>     fr = new FileReader(fastaProteinFileName);
>>>>     br = new BufferedReader(fr);
>>>>     String fastaLine;
>>>>     String startAccession = '>' + accessionId.trim();
>>>>     String fastaEntry = "";
>>>>     boolean record = false;
>>>>     while ((fastaLine = br.readLine()) != null) {
>>>>         fastaLine = fastaLine.trim() + '\n';
>>>>         if (fastaLine.startsWith(startAccession)) {
>>>>             record = true;
>>>>         } else if (record && fastaLine.startsWith(">")) {
>>>>             record = false;
>>>>             break;
>>>>         }
>>>>         if (record) {
>>>>             fastaEntry += fastaLine;
>>>>         }
>>>>     }
>>>>
>>>>
>>>> Notice - I do not use regex - since I'd need to read the whole file and
>>>> then regex upon it (if the record is the first one - I just read that one).
>>>>
>>>> Cheers
>>>> JP
>>>>
>>>>
>>>> On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>
>>>>     The "Mark invalid" exception is indicating that the parser has gone too
>>>>     far ahead in the file looking for a valid header. I'm not sure why but
>>>>     looking at your original query, there may be extra newlines embedded
>>>>     into your FASTA header line? That would definitely confuse it.
>>>>
>>>>     The parser is not able to currently pull out just one sequence - in
>>>>     effect this is a search facility, which it doesn't have. :(
>>>>
>>>>     thanks,
>>>>     Richard
>>>>
>>>>     JP wrote:
>>>>     > Hi all at BioJava,
>>>>     >
>>>>     > I am trying to parse several FASTA files using the following code:
>>>>     >
>>>>     > fr = new FileReader(fastaProteinFileName);
>>>>     >> br = new BufferedReader(fr);
>>>>     >>
>>>>     >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
>>>>     >> while (protIter.hasNext()) {
>>>>     >>      BioEntry bioEntry = protIter.nextBioEntry();
>>>>     >>      System.out.println (fastaProteinFileName + " == " +
>>>>     accessionId + " =
>>>>     >> " + bioEntry.getAccession());
>>>>     >> }
>>>>     >
>>>>     >
>>>>     > At particular points in my fasta file - I get the following 
>>>> exception:
>>>>     >
>>>>     > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception 
>>>> (from
>>>>     >> biojava library)
>>>>     >> org.biojava.bio.BioException: Could not read sequence
>>>>     >>     at
>>>>     >>
>>>>     
>>>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>>>>     >>     at
>>>>     >>
>>>>     
>>>> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
>>>>     >>     at
>>>>     >>
>>>>     
>>>> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
>>>>     >>     at
>>>>     >>
>>>>     
>>>> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
>>>>     >>     at
>>>>     >>
>>>>     
>>>> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
>>>>     >>     at
>>>>     >>
>>>>     
>>>> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
>>>>     >> Caused by: java.io.IOException: Mark invalid
>>>>     >>     at java.io.BufferedReader.reset(Unknown Source)
>>>>     >>     at
>>>>     >>
>>>>     
>>>> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>>>>     >>     at
>>>>     >>
>>>>     
>>>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>>>>     >>     ... 5 more
>>>>     >
>>>>     >
>>>>     > Interestingly if I delete the header portion of the header line (from
>>>>     > type=protein... till the end of the line ...Dgri;)
>>>>     >
>>>>     >> FBpp0145468 type=protein;
>>>>     >>
>>>>     
>>>> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
>>>>     >> ID=FBpp0145468; name=Dgri\GH11562-PA; 
>>>> parent=FBgn0119042,FBtr0146976;
>>>>     >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
>>>>     >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
>>>>     >> species=Dgri;
>>>>     >>
>>>>     >
>>>>     > It works - but I have a number of these exceptions (and I do not
>>>>     want to
>>>>     > edit the original data).  Mind you I have longer headers in my
>>>>     file which
>>>>     > are parsed OK (strange!).
>>>>     >
>>>>     > Any ideas anyone ?  Alternatively - is there a better way how to
>>>>     get ONE
>>>>     > SINGLE sequence from the whole fasta file give that I have the
>>>>     accession id
>>>>     > (FBpp0145468) ?
>>>>     >
>>>>     > Many Thanks
>>>>     > JP
>>>>     > _______________________________________________
>>>>     > Biojava-l mailing list  -  [email protected]
>>>>     <mailto:[email protected]>
>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>     >
>>>>
>>>>     --
>>>>     Richard Holland, BSc MBCS
>>>>     Finance Director, Eagle Genomics Ltd
>>>>     T: +44 (0)1223 654481 ext 3 | E: [email protected]
>>>>     <mailto:[email protected]>
>>>>     http://www.eaglegenomics.com/
>>>>
>>>>
>>> -- 
>>> Richard Holland, BSc MBCS
>>> Finance Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: [email protected]
>>> http://www.eaglegenomics.com/
>>> _______________________________________________
>>> Biojava-l mailing list  -  [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>

Index: src/org/biojavax/bio/seq/io/FastaFormat.java
===================================================================
--- src/org/biojavax/bio/seq/io/FastaFormat.java	(revision 6947)
+++ src/org/biojavax/bio/seq/io/FastaFormat.java	Tue May 12 15:46:40 EDT 2009
@@ -21,17 +21,6 @@
 
 package org.biojavax.bio.seq.io;
 
-import java.io.BufferedInputStream;
-import java.io.BufferedReader;
-import java.io.File;
-import java.io.FileReader;
-import java.io.IOException;
-import java.io.InputStreamReader;
-import java.io.PrintStream;
-import java.util.Map;
-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
 import org.biojava.bio.seq.Sequence;
 import org.biojava.bio.seq.io.ParseException;
 import org.biojava.bio.seq.io.SeqIOListener;
@@ -46,7 +35,12 @@
 import org.biojavax.SimpleNamespace;
 import org.biojavax.bio.seq.RichSequence;
 
+import java.io.*;
+import java.nio.CharBuffer;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
 
+
 /**
  * Format object representing FASTA files. These files are almost pure
  * sequence data.
@@ -115,10 +109,11 @@
 	 * A stream is in FASTA format if the stream starts with ">".
 	 */
 	public boolean canRead(BufferedInputStream stream) throws IOException {
-		stream.mark(2000); // some streams may not support this
+		stream.mark(5); // some streams may not support this
 		BufferedReader br = new BufferedReader(new InputStreamReader(stream));
-		String firstLine = br.readLine();
-		boolean readable = firstLine!=null && firstLine.startsWith(">");
+		CharBuffer cb = CharBuffer.allocate(1);
+		int numChars = br.read(cb);
+		boolean readable = numChars > 0 && cb.get(0) == '>';
 		// don't close the reader as it'll close the stream too.
 		// br.close();
 		stream.reset();
@@ -191,20 +186,35 @@
 
 		processHeader(line,rsiol,ns);
 
-		StringBuffer seq = new StringBuffer();
-		boolean hasMoreSeq = true;
+		StringBuffer seq = new StringBuffer();			//Buffer to hold sequence data.
+		CharBuffer cb = CharBuffer.allocate(1);			//CharBuffer to hold a single character look ahead.
+		boolean hasMoreSeq = true;						//Boolean to control iterating over sequence lines.
+		int numChars = 0;								//Initializing int to track read status.
+
 		while (hasMoreSeq) {
-			reader.mark(500);
+			//Mark the current buffer, read in a single character,
+			//and then reset the buffer back to the previous position.
+			reader.mark(5);
+			numChars = reader.read(cb);
+			reader.reset();
+
+			//Do this if we successfully read in a character.
+			if (numChars > 0) {
+				//Exit the while loop if we reach the start of the next FASTA record.
+				if (cb.get(0) == '>') {
+					hasMoreSeq = false;
+				}
+				//Otherwise read the entire line and store the sequence data.
+				else {
-			line = reader.readLine();
+					line = reader.readLine();
-			if (line!=null) {
+					if (line != null) {
-				line = line.trim();
+						line = line.trim();
-				if (line.length() > 0 && line.charAt(0)=='>') {
-					reader.reset();
-					hasMoreSeq = false;
-				} else {
-					seq.append(line);
-				}
+						seq.append(line);
+					}
-			} else {
+				}
+			}
+			//Exit the while loop if we have reached the end of the file.
+			else {
 				hasMoreSeq = false;
 			}
 		}
@@ -225,7 +235,7 @@
 
 		rsiol.endSequence();
 
-		return line!=null;
+		return numChars > 0;
 	}
 
 	/** Parse the Header information from the Fasta Description line

/*
 *                    BioJava development code
 *
 * This code may be freely distributed and modified under the
 * terms of the GNU Lesser General Public Licence.  This should
 * be distributed with the code.  If you do not have a copy,
 * see:
 *
 *      http://www.gnu.org/copyleft/lesser.html
 *
 * Copyright for this code is held jointly by the individual
 * authors.  These should be listed in @author doc comments.
 *
 * For more information on the BioJava project and its aims,
 * or to join the biojava-l mailing list, visit the home page
 * at:
 *
 *      http://www.biojava.org/
 *
 */

package org.biojavax.bio.seq.io;

import org.biojava.bio.seq.Sequence;
import org.biojava.bio.seq.io.ParseException;
import org.biojava.bio.seq.io.SeqIOListener;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojava.bio.symbol.IllegalSymbolException;
import org.biojava.bio.symbol.SimpleSymbolList;
import org.biojava.bio.symbol.Symbol;
import org.biojava.bio.symbol.SymbolList;
import org.biojava.utils.ChangeVetoException;
import org.biojavax.Namespace;
import org.biojavax.RichObjectFactory;
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.RichSequence;

import java.io.*;
import java.nio.CharBuffer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


/**
 * Format object representing FASTA files. These files are almost pure
 * sequence data.
 * @author Thomas Down
 * @author Matthew Pocock
 * @author Greg Cox
 * @author Lukas Kall
 * @author Richard Holland
 * @author Mark Schreiber
 * @since 1.5
 */

public class FastaFormat extends RichSequenceFormat.HeaderlessFormat {

	// Register this format with the format auto-guesser.
	static {
		RichSequence.IOTools.registerFormat(FastaFormat.class);
	}

	/**
	 * The name of this format
	 */
	public static final String FASTA_FORMAT = "FASTA";

	// header line
	protected static final Pattern hp = Pattern.compile(">\\s*(\\S+)(\\s+(.*))?");
	// description chunk
	protected static final Pattern dp = Pattern.compile( "^(gi\\|(\\d+)\\|)?(\\w+)\\|(\\w+?)(\\.(\\d+))?\\|(\\w+)?$");

	protected static final Pattern readableFiles = Pattern.compile(".*(fa|fas)$");
	protected static final Pattern aminoAcids = Pattern.compile(".*[FLIPQE].*");

	private FastaHeader header = new FastaHeader();

	/**
	 * {...@inheritdoc}
	 * A file is in FASTA format if the name ends with fa or fas, or the file starts with ">".
	 */
	@Override
	public boolean canRead(File file) throws IOException {
		if (readableFiles.matcher(file.getName()).matches()) return true;
		BufferedReader br = new BufferedReader(new FileReader(file));
		String firstLine = br.readLine();
		boolean readable = firstLine!=null && firstLine.startsWith(">");
		br.close();
		return readable;
	}

	/**
	 * {...@inheritdoc}
	 * Returns an protein parser if the first line of sequence contains any of F/L/I/P/Q/E, 
	 * otherwise returns a DNA tokenizer.
	 */
	@Override
	public SymbolTokenization guessSymbolTokenization(File file) throws IOException {
		BufferedReader br = new BufferedReader(new FileReader(file));
		br.readLine(); // discard first line
		boolean aa = aminoAcids.matcher(br.readLine()).matches();
		br.close();
		if (aa) return RichSequence.IOTools.getProteinParser();
		else return RichSequence.IOTools.getDNAParser();
	}

	/**
	 * {...@inheritdoc}
	 * A stream is in FASTA format if the stream starts with ">".
	 */
	public boolean canRead(BufferedInputStream stream) throws IOException {
		stream.mark(5); // some streams may not support this
		BufferedReader br = new BufferedReader(new InputStreamReader(stream));
		CharBuffer cb = CharBuffer.allocate(1);
		int numChars = br.read(cb);
		boolean readable = numChars > 0 && cb.get(0) == '>';
		// don't close the reader as it'll close the stream too.
		// br.close();
		stream.reset();
		return readable;
	}

	/**
	 * {...@inheritdoc}
	 * Returns an protein parser if the first line of sequence contains any of F/L/I/P/Q/E, 
	 * otherwise returns a DNA tokenizer.
	 */
	public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream) throws IOException {
		stream.mark(2000); // some streams may not support this
		BufferedReader br = new BufferedReader(new InputStreamReader(stream));
		br.readLine(); // discard first line
		boolean aa = aminoAcids.matcher(br.readLine()).matches();
		// don't close the reader as it'll close the stream too.
		// br.close();
		stream.reset();
		if (aa) return RichSequence.IOTools.getProteinParser();
		else return RichSequence.IOTools.getDNAParser();
	}

	/**
	 * {...@inheritdoc}
	 */
	public boolean readSequence(
			BufferedReader reader,
			SymbolTokenization symParser,
			SeqIOListener listener
	)	throws
	IllegalSymbolException,
	IOException,
	ParseException {
		if (!(listener instanceof RichSeqIOListener)) throw new IllegalArgumentException("Only accepting RichSeqIOListeners today");
		return this.readRichSequence(reader,symParser,(RichSeqIOListener)listener,null);
	}

	/**
	 * {...@inheritdoc}
	 * If namespace is null, then the namespace of the sequence in the fasta is used.
	 * If the namespace is null and so is the namespace of the sequence in the fasta,
	 * then the default namespace is used.
	 */
	public boolean readRichSequence(
			BufferedReader reader,
			SymbolTokenization symParser,
			RichSeqIOListener rsiol,
			Namespace ns
	)	throws
	IllegalSymbolException,
	IOException,
	ParseException {

		String line = reader.readLine();
		if (line == null) {
			throw new IOException("Premature stream end");
		}
		while(line.length() == 0) {
			line = reader.readLine();
			if (line == null) {
				throw new IOException("Premature stream end");
			}
		}
		if (!line.startsWith(">")) {
			throw new IOException("Stream does not appear to contain FASTA formatted data: " + line);
		}

		rsiol.startSequence();

		processHeader(line,rsiol,ns);

		StringBuffer seq = new StringBuffer();			//Buffer to hold sequence data.
		CharBuffer cb = CharBuffer.allocate(1);			//CharBuffer to hold a single character look ahead.
		boolean hasMoreSeq = true;						//Boolean to control iterating over sequence lines.
		int numChars = 0;								//Initializing int to track read status.

		while (hasMoreSeq) {
			//Mark the current buffer, read in a single character,
			//and then reset the buffer back to the previous position.
			reader.mark(5);
			numChars = reader.read(cb);
			reader.reset();

			//Do this if we successfully read in a character.
			if (numChars > 0) {
				//Exit the while loop if we reach the start of the next FASTA record.
				if (cb.get(0) == '>') {
					hasMoreSeq = false;
				}
				//Otherwise read the entire line and store the sequence data.
				else {
					line = reader.readLine();
					if (line != null) {
						line = line.trim();
						seq.append(line);
					}
				}
			}
			//Exit the while loop if we have reached the end of the file.
			else {
				hasMoreSeq = false;
			}
		}
		if (!this.getElideSymbols()) {
			try {
				SymbolList sl = new SimpleSymbolList(symParser,
						seq.toString().replaceAll("\\s+","").replaceAll("[\\.|~]","-"));
				rsiol.addSymbols(symParser.getAlphabet(),
						(Symbol[])(sl.toList().toArray(new Symbol[0])),
						0, sl.length());
			} catch (Exception e) {
				// do not know name and gi any longer, replace them with empty string.
				// why does the rsiol only have setter methods, but not getter???
				String message = ParseException.newMessage(this.getClass(), "", "", "problem parsing symbols", seq.toString());
				throw new ParseException(e, message);
			}
		}

		rsiol.endSequence();

		return numChars > 0;
	}

	/** Parse the Header information from the Fasta Description line
	 * 
	 * @param line
	 * @param rsiol
	 * @param ns
	 * @throws IOException
	 * @throws ParseException
	 */
	public void processHeader(String line,RichSeqIOListener rsiol,Namespace ns) 
	throws IOException, ParseException {
		Matcher m = hp.matcher(line);
		if (!m.matches()) {
			throw new IOException("Stream does not appear to contain FASTA formatted data: " + line);
		}

		String name = m.group(1);
		String desc = m.group(3);
		String gi = null;

		m = dp.matcher(name);
		if (m.matches()) {
			gi = m.group(2);
			String namespace = m.group(3);
			String accession = m.group(4);
			String verString = m.group(6);
			int version = verString==null?0:Integer.parseInt(verString);
			name = m.group(7);
			if (name==null) name=accession;

			rsiol.setAccession(accession);
			rsiol.setVersion(version);
			if (gi!=null) rsiol.setIdentifier(gi);
			if (ns==null) rsiol.setNamespace((Namespace)RichObjectFactory.getObject(SimpleNamespace.class,new Object[]{namespace}));
			else rsiol.setNamespace(ns);
		} else {
			rsiol.setAccession(name);
			rsiol.setNamespace((ns==null?RichObjectFactory.getDefaultNamespace():ns));
		}
		rsiol.setName(name);
		if (!this.getElideComments()) rsiol.setDescription(desc);

	}

	/**
	 * {...@inheritdoc}
	 */
	public void	writeSequence(Sequence seq, PrintStream os) throws IOException {
		if (this.getPrintStream()==null) this.setPrintStream(os);
		this.writeSequence(seq, RichObjectFactory.getDefaultNamespace());
	}

	/**
	 * {...@inheritdoc}
	 */
	public void writeSequence(Sequence seq, String format, PrintStream os) throws IOException {
		if (this.getPrintStream()==null) this.setPrintStream(os);
		if (!format.equals(this.getDefaultFormat())) throw new IllegalArgumentException("Unknown format: "+format);
		this.writeSequence(seq, RichObjectFactory.getDefaultNamespace());
	}


	/**
	 * {...@inheritdoc}
	 * If namespace is null, then the sequence's own namespace is used.
	 */
	public void writeSequence(Sequence seq, Namespace ns) throws IOException {
		RichSequence rs;
		try {
			if (seq instanceof RichSequence) rs = (RichSequence)seq;
			else rs = RichSequence.Tools.enrich(seq);
		} catch (ChangeVetoException e) {
			IOException e2 = new IOException("Unable to enrich sequence");
			e2.initCause(e);
			throw e2;
		}

		StringBuilder sb = new StringBuilder();
		sb.append(">");

		String identifier = rs.getIdentifier();
		if (header.isShowIdentifier() && identifier!=null && !"".equals(identifier)) {
			sb.append("gi|");
			sb.append(identifier);
			sb.append("|");
		}
		if(header.isShowNamespace()){
			sb.append((ns==null?rs.getNamespace().getName():ns.getName()));
			sb.append("|");
		}
		if(header.isShowAccession()){
			sb.append(rs.getAccession());
			if(header.isShowVersion()){
				sb.append(".");
			}
		}
		if(header.isShowVersion()){
			sb.append(rs.getVersion());
			sb.append("|");
		}
		if(header.isShowName()){
			sb.append(rs.getName());
			sb.append(" ");
		}else{
			sb.append(" "); //in case the show the description there needs to be space
		}
		if(header.isShowDescription()){
			String desc = rs.getDescription();
			if (desc!=null && !"".equals(desc)) sb.append(desc.replaceAll("\\n"," "));
		}
		if(sb.charAt(sb.length() -1) == '|'){
			sb.deleteCharAt(sb.length() -1);
		}
		this.getPrintStream().print(sb.toString());
		this.getPrintStream().println();

		int length = rs.length();

		for (int pos = 1; pos <= length; pos += this.getLineWidth()) {
			int end = Math.min(pos + this.getLineWidth() - 1, length);
			this.getPrintStream().println(rs.subStr(pos, end));
		}
	}

	/**
	 * {...@inheritdoc}
	 */
	public String getDefaultFormat() {
		return FASTA_FORMAT;
	}

	public FastaHeader getHeader() {
		return header;
	}

	public void setHeader(FastaHeader header) {
		this.header = header;
	}
}

package org.biojavax.bio.seq.io;

import junit.framework.TestCase;
import org.biojava.bio.symbol.Alphabet;
import org.biojava.bio.symbol.AlphabetManager;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.Namespace;
import org.biojavax.RichObjectFactory;
import org.biojavax.Note;
import org.biojavax.ontology.ComparableTerm;

import java.io.*;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Set;
import java.util.Iterator;

/**
 * Tests for FastaFormat.
 *
 * @author Josh Goodman
 */
public class FastaFormatTest extends TestCase {
    private FastaFormat fastaFormat;

    /**
     * @see junit.framework.TestCase#setUp()
     */
    protected void setUp() {
        this.fastaFormat = new FastaFormat();
    }

	private RichSequence readFile(String filename) {
        InputStream inStream = this.getClass().getResourceAsStream(filename);
        BufferedReader br = new BufferedReader(new InputStreamReader(inStream));
        SymbolTokenization tokenization = RichSequence.IOTools.getProteinParser();
        Namespace namespace = RichObjectFactory.getDefaultNamespace();
        SimpleRichSequenceBuilder builder = new SimpleRichSequenceBuilder();
        RichSequence sequence = null;
        try {
            fastaFormat.readRichSequence(br, tokenization, builder, namespace);
            sequence = builder.makeRichSequence();
        } catch (Exception e) {
            e.printStackTrace();
            fail("Unexpected exception: "+e);
        }
		return sequence;
	}
	
	public void testCanReadFile() {
		try {
			URL file = this.getClass().getResource("/files/AAL039263.fa");
			assertTrue(fastaFormat.canRead(new File(file.toURI())));
		} catch (URISyntaxException e) {
			e.printStackTrace();
			fail("URI Syntax exception: " + e);
		} catch (IOException e) {
			e.printStackTrace();
			fail("IO exception: " + e);
		}
	}

	public void testCanReadStream() {
		try {
			InputStream stream = this.getClass().getResourceAsStream("/files/AAL039263.fa");
			assertTrue(fastaFormat.canRead(new BufferedInputStream(stream)));
			stream.close();
		} catch (IOException e) {
			e.printStackTrace();
			fail("IO exception: " + e);
		}
	}

	public void testGuessSymbolTokenizationFile() {
		try {
			URL file = this.getClass().getResource("/files/AAL039263.fa");
			Alphabet prot1 = AlphabetManager.alphabetForName("PROTEIN-TERM");
			Alphabet prot2 = fastaFormat.guessSymbolTokenization(new File(file.toURI())).getAlphabet();
			assertTrue(prot1.equals(prot2));
		} catch (IOException e) {
			e.printStackTrace();
			fail("IO exception: " + e);
		} catch (URISyntaxException e) {
			e.printStackTrace();
			fail("URI Syntax exception: " + e);
		}
	}

	public void testGuessSymbolTokenizationStream() {
		try {
			InputStream stream = this.getClass().getResourceAsStream("/files/AAL039263.fa");
			Alphabet prot1 = AlphabetManager.alphabetForName("PROTEIN-TERM");
			Alphabet prot2 = fastaFormat.guessSymbolTokenization(new BufferedInputStream(stream)).getAlphabet();
			assertTrue(prot1.equals(prot2));
		} catch (IOException e) {
			e.printStackTrace();
			fail("IO exception: " + e);
		}
	}

    public void testReadFastaFormat() {
        RichSequence sequence = readFile("/files/AAL039263.fa");
        assertNotNull(sequence);
        assertEquals(sequence.getName(), "AAL39263");
        assertEquals(sequence.getAccession(), "AAL39263");
        assertEquals(sequence.getInternalSymbolList().length(), 70);
    }


}

_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] FASTA parsing bug ?

Reply via email to