Andres Pinzon wrote:
The output is correct, but notseq changes the definition in the fasta
headers, so if the fasta header in "xaa.list.fasta" was:

lcl|29855|ORF26673_6

the corresponding fasta header in sequence in 1000-1.fasta is:

29855

Is there a way to tell "notseq" to keep the original fasta headers intact?

Yes.

FASTA format is not simple ... we have seen many ways to hide extra information in the ID (EMBOSS recognizes NCBI id formats and parses out the ID 29855) and also in the description (we try to recognize conventions used by GCG and ACEDB)

But you can also specify "pearson" format which reads the ID without parsing. Just add to the commandline:

notseq -sf pearson

Now you have another problem. This will not work for notseq!!!

The exclude string in notseq is a pattern. In processing the pattern, some pattern characters are removed:

        whitespace
        ',' and ';'
        '|'

So your exclude pattern cannot include any '|' chatracters.

As a workaround, you can exclude "*ORF26673_6" and the IDs will be preserved.

For the next release we will allow '|' characters. When notseq was first written there was a possibility to use regualr expressions, but now we only use simple text matching so the pipe characters are not a problem.

Hope that helps

Peter

_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss

Reply via email to