Hi Koen - There was a method in SeqIOTools that can (mostly) guess the alphabet of a file but it is deprecated cause there is no standard convention of file naming. ClustalW guesses by pre-reading the the file and looking for symbols that don't occur in DNA that are found in protein. They claim it's accuracy at guessing is in the high 90's but I'm not sure how they calculate that number.
Bascially there is absolutely no failsafe way to know if a fasta file is DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide which contains only acg and t although it becomes very unlikely with longer sequences. If you have control over the files you could adopt some naming specification (I use .fna for fasta DNA or faa for fasta amino acid). An alternative is to allow the specification of format and alphabet in the arguments to the program. - Mark Koen van der Drift <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 11/12/2004 06:21 AM To: biojava-list <[EMAIL PROTECTED]> cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] opening unknown fasta file Hi, The BioJava tutorial (in anger) suggests the following code to open a fasta file: [snip] // get the appropriate Alphabet Alphabet alpha = AlphabetManager.alphabetForName(args[1]); // get a SequenceDB of all sequences in the file SequenceDB db = SeqIOTools.readFasta(is, alpha); But what should I do when I don't know if the fasta file contains a protein or dna sequence? thanks, - Koen. _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l