Obviously take my input for what it's worth, I'm a programmer by trade with an interest in genetics so I lean towards (and understand better) the comp science aspects of these discussions. I hope my humble suggestions are at least somewhat helpful. Based on my understanding of what is being discussed in this thread, however, you should be able to programmatically (not algorithmically) solive this particular scenario. I could look at it further (an API/design based or pattern based solution) when I get a chance, if anyone thinks it worthwhile.
just my thoughts,
jess vermont chicago
Universes of virtually unlimited complexity can be created in the form of computer programs. (Joseph Weizenbaum)
From: Thomas Down <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] CC: biojava-list <[EMAIL PROTECTED]> Subject: Re: [Biojava-l] opening unknown fasta file Date: Fri, 12 Nov 2004 16:26:05 +0000
On Fri, Nov 12, 2004 at 10:01:13AM +0800, [EMAIL PROTECTED] wrote:
>
> Bascially there is absolutely no failsafe way to know if a fasta file is
> DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide
> which contains only acg and t although it becomes very unlikely with
> longer sequences.
The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols that appear in DNA sequences. Ns are everywhere, but many of the other ambiguities appear from time to time, too.
If we were *really* serious about alphabet-guessing (which scares me, to be
honest), one option would be to calculate histograms of character frequencies
in EMBL and Swissprot, and look for the closest match. I believe that
Internet Explorer takes this approach when it hits a web page without an
explicitly-specified character encoding -- it apparently works pretty well...
Does anyone feel this serious?
Thomas. _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
_______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l