[EMBOSS] seqret messes up on phylip input with duplicate sequence names

Jan Kim Wed, 06 Aug 2014 12:57:11 -0700

Dear All,

I've run into a somewhat strange problem while using seqret to convert
from phylip to fasta (the default) format. Essentially, when a phylip
file contains multiple sequences with the same name, weird things happen,
either core dumps, or all sequences, including their names, get concatenated
to one "EMBOSS_001" sequence. Please see below for an example that is
reproducible for me with EMBOSS 6.4.0 (from an Ubuntu package) and
6.5.7 (compiled by myself).


I think the issue is not specific to seqret but rather an issue in the
sequence reading library. Perhaps some function decides that the input
isn't valid phylip when it encounters the duplicate name, and this
triggers falling back to reading the entire file as raw.

As a bit of context, I ran into this because fdnadist mysteriously
produced an 1 x 1 matrix with the row name "EMBOSS_001". It took me quite
a while to figure out that this was triggered by duplicate sequence names,
which I didn't expect to exist in the input. But if I'm allowed this
whinge, an error or warning such as "duplicate sequence name in phylip
input -- giving up" might have directed me to the root of the problem
more quickly. (I can still hope, though, that this email saves someone
else a bit of time hunting down a related issue.)

Best regards, Jan

----- 8< --- reproducible example -----------------------------------------

$ # dnadist.phy is an example input copied from the fdnadist HTML documentation 
page
$ cat dnadist.phy
   5   13
Alpha     AACGTGGCCACAT
Beta      AAGGTCGCCACAC
Gamma     CAGTTCGCCACAA
Delta     GAGATTTCCGCCT
Epsilon   GAGATCTCCGCCC
$ seqret dnadist.phy -outseq stdout
Read and write (return) sequences
>Alpha
AACGTGGCCACAT
>Beta
AAGGTCGCCACAC
>Gamma
CAGTTCGCCACAA
>Delta
GAGATTTCCGCCT
>Epsilon
$ # so if the input is good, the output is good too. Replace "Beta " with
$ "Alpha", though, so that "Alpha" is a duplicate identifier, and...
$ seqret dnabroken.phy -outseq stdout
Read and write (return) sequences
>Epsilon
GAGATCTCCGCCC
Segmentation fault (core dumped)
$ # the core dump can be "fixed" by adding an empty line to the broken file:
$ echo >> dnabroken.phy 
$ seqret dnabroken.phy -outseq stdout
Read and write (return) sequences
>EMBOSS_001
AlphaAACGTGGCCACATAlphaAAGGTCGCCACACGammaCAGTTCGCCACAADeltaG
AGATTTCCGCCTEpsilonGAGATCTCCGCCC
$ echo $?
0
$ # so seqret has written something and hasn't complained, but the output
$ # is really garbage

-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: [email protected]                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*
_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

[EMBOSS] seqret messes up on phylip input with duplicate sequence names

Reply via email to