Re: [EMBOSS] FASTA format appears to get misrecognised as GCG

Peter Rice Thu, 12 Mar 2015 01:50:54 -0700

Hi Jan,

On 11/03/2015 19:15, Jan Kim wrote:

Dear All,


I've just had "water" in EMBOSS 6.5.7.0 fail, and traced this back
to the regular expression "CHECK: [0-9].*\.\." matching the header
line of a FASTA file. The command

     water -asequence b.fasta  [...]  -auto

terminates with

     Warning: Sequence 'gcg::b.fasta:broken' has zero length, ignored
     Error: Unable to read sequence 'b.fasta'

As a minimal demo, any sequence with the header

     >broken CHECK: 0 ..

causes the problem, and expressly stating the format (via "fasta::b.fasta"
rather than just "b.fasta") fixes it.

My speculation at this point is that somehow matching the regexp mentioned
above causes the autodetection to identify the format as GCG rather than
as FASTA.

That is what I would expect. We test for GCG format first, whichrequires scanning through a possibly long header looking for a checksumline, and then testing whether we can read as GCG.

It is supposed to continue trying other formats if GCG format fails. Wecan try variations on your 'broken' FASTA format and make sure EMBOSScan read it as FASTA in future.

The problem arises from legacy interpretation of GCG format. GCG had aprogram called 'reformat' that would correct the check: line afterediting, so EMBOSS tried to replicate this by reading even if no lengthwas found. I do not believe anyone is depending on this fesature now, sowe can safely also check for a length: value and use that.

This doesn't exactly match my expectations based on the USA specs [1],
according to which EMBOSS expects FASTA by default and will try other
formats only if that doesn't work. (I have some inkling that this
default can be configured somewhere, but I haven't found anything
suspicious in /usr/local/share/EMBOSS and a quick scan didn't turn up
any stray .embossrc files either.)

Ah, perhaps we could rephrase that. If no format is specified, EMBOSStries all possible formats.

Even FASTA format is complicated - especially how EMBOSS reads the ID.There are various versions of FASTA format where EMBOSS can read anNCBI/Blast style ID ('ncbi' format) or use whatever is there withouttrying to parse of clean it up ('pearson' format) which you have tospecify explicitly.

The default format can be configures by setting environment varaibleEMBOSS_FORMAT but using fasta:: in the USA, or following it with-sformat fasta (or -sf fasta) is the usual way.

As a bit of background, this happened in an "embedded script", and the
regexp was right in the sense that stuff from a GCG (or similar) formatted
file had found its way into the FASTA header. I hope I fixed my script
now by expressly stating the format; this posting is to solicit comments
regarding whether I've done something wrong / stupid (and possibly to
leave some hints regarding this matter in the mailing list archives...).
[1] http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html

Many thanks for pointing this out. It will be cleaned up in a futureversion (I just tested the change) and we will revise the description onthe website.


regards,

Peter Rice
EMBOSS Team
_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

Re: [EMBOSS] FASTA format appears to get misrecognised as GCG

Reply via email to