Hi Jan,

On 11/03/2015 19:15, Jan Kim wrote:
Dear All,

I've just had "water" in EMBOSS 6.5.7.0 fail, and traced this back
to the regular expression "CHECK: [0-9].*\.\." matching the header
line of a FASTA file. The command

     water -asequence b.fasta  [...]  -auto

terminates with

     Warning: Sequence 'gcg::b.fasta:broken' has zero length, ignored
     Error: Unable to read sequence 'b.fasta'

As a minimal demo, any sequence with the header

     >broken CHECK: 0 ..

causes the problem, and expressly stating the format (via "fasta::b.fasta"
rather than just "b.fasta") fixes it.

My speculation at this point is that somehow matching the regexp mentioned
above causes the autodetection to identify the format as GCG rather than
as FASTA.

That is what I would expect. We test for GCG format first, which requires scanning through a possibly long header looking for a checksum line, and then testing whether we can read as GCG.

It is supposed to continue trying other formats if GCG format fails. We can try variations on your 'broken' FASTA format and make sure EMBOSS can read it as FASTA in future.

The problem arises from legacy interpretation of GCG format. GCG had a program called 'reformat' that would correct the check: line after editing, so EMBOSS tried to replicate this by reading even if no length was found. I do not believe anyone is depending on this fesature now, so we can safely also check for a length: value and use that.

This doesn't exactly match my expectations based on the USA specs [1],
according to which EMBOSS expects FASTA by default and will try other
formats only if that doesn't work. (I have some inkling that this
default can be configured somewhere, but I haven't found anything
suspicious in /usr/local/share/EMBOSS and a quick scan didn't turn up
any stray .embossrc files either.)

Ah, perhaps we could rephrase that. If no format is specified, EMBOSS tries all possible formats.

Even FASTA format is complicated - especially how EMBOSS reads the ID. There are various versions of FASTA format where EMBOSS can read an NCBI/Blast style ID ('ncbi' format) or use whatever is there without trying to parse of clean it up ('pearson' format) which you have to specify explicitly.

The default format can be configures by setting environment varaible EMBOSS_FORMAT but using fasta:: in the USA, or following it with -sformat fasta (or -sf fasta) is the usual way.

As a bit of background, this happened in an "embedded script", and the
regexp was right in the sense that stuff from a GCG (or similar) formatted
file had found its way into the FASTA header. I hope I fixed my script
now by expressly stating the format; this posting is to solicit comments
regarding whether I've done something wrong / stupid (and possibly to
leave some hints regarding this matter in the mailing list archives...).
[1] http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html

Many thanks for pointing this out. It will be cleaned up in a future version (I just tested the change) and we will revise the description on the website.

regards,

Peter Rice
EMBOSS Team
_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

Reply via email to