Re: [Biojava-l] opening unknown fasta file

mark . schreiber Thu, 11 Nov 2004 18:04:06 -0800

Hi Koen -

There was a method in SeqIOTools that can (mostly) guess the alphabet of a 
file but it is deprecated cause there is no standard convention of file 
naming.  ClustalW guesses by pre-reading the the file and looking for 
symbols that don't occur in DNA that are found in protein. They claim it's 
accuracy at guessing is in the high 90's but I'm not sure how they 
calculate that number.


Bascially there is absolutely no failsafe way to know if a fasta file is 
DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide 
which contains only acg and t although it becomes very unlikely with 
longer sequences. If you have control over the files you could adopt some 
naming specification (I use .fna for fasta DNA or faa for fasta amino 
acid). An alternative is to allow the specification of format and alphabet 
in the arguments to the program.

- Mark





Koen van der Drift <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]
11/12/2004 06:21 AM

 
        To:     biojava-list <[EMAIL PROTECTED]>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] opening unknown fasta file


Hi,

The BioJava tutorial (in anger) suggests the following code to open a 
fasta file:

[snip]

  // get the appropriate Alphabet
    Alphabet alpha = AlphabetManager.alphabetForName(args[1]);

  // get a SequenceDB of all sequences in the file
    SequenceDB db = SeqIOTools.readFasta(is, alpha);


But what should I do when I don't know if the fasta file contains a 
protein or dna sequence?


thanks,

- Koen.

_______________________________________________
Biojava-l mailing list  -  [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l



_______________________________________________
Biojava-l mailing list  -  [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l

Re: [Biojava-l] opening unknown fasta file

Reply via email to