Andres Pinzon wrote:
The output is correct, but notseq changes the definition in the fasta
headers, so if the fasta header in "xaa.list.fasta" was:
lcl|29855|ORF26673_6
the corresponding fasta header in sequence in 1000-1.fasta is:
29855
Is there a way to tell "notseq" to keep the original fasta headers intact?
Yes.
FASTA format is not simple ... we have seen many ways to hide extra
information in the ID (EMBOSS recognizes NCBI id formats and parses out
the ID 29855) and also in the description (we try to recognize
conventions used by GCG and ACEDB)
But you can also specify "pearson" format which reads the ID without
parsing. Just add to the commandline:
notseq -sf pearson
Now you have another problem. This will not work for notseq!!!
The exclude string in notseq is a pattern. In processing the pattern,
some pattern characters are removed:
whitespace
',' and ';'
'|'
So your exclude pattern cannot include any '|' chatracters.
As a workaround, you can exclude "*ORF26673_6" and the IDs will be
preserved.
For the next release we will allow '|' characters. When notseq was first
written there was a possibility to use regualr expressions, but now we
only use simple text matching so the pipe characters are not a problem.
Hope that helps
Peter
_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss