Dear All, I've run into a somewhat strange problem while using seqret to convert from phylip to fasta (the default) format. Essentially, when a phylip file contains multiple sequences with the same name, weird things happen, either core dumps, or all sequences, including their names, get concatenated to one "EMBOSS_001" sequence. Please see below for an example that is reproducible for me with EMBOSS 6.4.0 (from an Ubuntu package) and 6.5.7 (compiled by myself).
I think the issue is not specific to seqret but rather an issue in the sequence reading library. Perhaps some function decides that the input isn't valid phylip when it encounters the duplicate name, and this triggers falling back to reading the entire file as raw. As a bit of context, I ran into this because fdnadist mysteriously produced an 1 x 1 matrix with the row name "EMBOSS_001". It took me quite a while to figure out that this was triggered by duplicate sequence names, which I didn't expect to exist in the input. But if I'm allowed this whinge, an error or warning such as "duplicate sequence name in phylip input -- giving up" might have directed me to the root of the problem more quickly. (I can still hope, though, that this email saves someone else a bit of time hunting down a related issue.) Best regards, Jan ----- 8< --- reproducible example ----------------------------------------- $ # dnadist.phy is an example input copied from the fdnadist HTML documentation page $ cat dnadist.phy 5 13 Alpha AACGTGGCCACAT Beta AAGGTCGCCACAC Gamma CAGTTCGCCACAA Delta GAGATTTCCGCCT Epsilon GAGATCTCCGCCC $ seqret dnadist.phy -outseq stdout Read and write (return) sequences >Alpha AACGTGGCCACAT >Beta AAGGTCGCCACAC >Gamma CAGTTCGCCACAA >Delta GAGATTTCCGCCT >Epsilon $ # so if the input is good, the output is good too. Replace "Beta " with $ "Alpha", though, so that "Alpha" is a duplicate identifier, and... $ seqret dnabroken.phy -outseq stdout Read and write (return) sequences >Epsilon GAGATCTCCGCCC Segmentation fault (core dumped) $ # the core dump can be "fixed" by adding an empty line to the broken file: $ echo >> dnabroken.phy $ seqret dnabroken.phy -outseq stdout Read and write (return) sequences >EMBOSS_001 AlphaAACGTGGCCACATAlphaAAGGTCGCCACACGammaCAGTTCGCCACAADeltaG AGATTTCCGCCTEpsilonGAGATCTCCGCCC $ echo $? 0 $ # so seqret has written something and hasn't complained, but the output $ # is really garbage -- +- Jan T. Kim -------------------------------------------------------+ | email: [email protected] | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* _______________________________________________ EMBOSS mailing list [email protected] http://mailman.open-bio.org/mailman/listinfo/emboss
