Re: [EMBOSS] notseq and fasta definition headers

Peter Rice Tue, 17 Jun 2008 13:30:31 -0700

Andres Pinzon wrote:

The output is correct, but notseq changes the definition in the fasta
headers, so if the fasta header in "xaa.list.fasta" was:


lcl|29855|ORF26673_6

the corresponding fasta header in sequence in 1000-1.fasta is:

29855

Is there a way to tell "notseq" to keep the original fasta headers intact?


Yes.

FASTA format is not simple ... we have seen many ways to hide extrainformation in the ID (EMBOSS recognizes NCBI id formats and parses outthe ID 29855) and also in the description (we try to recognizeconventions used by GCG and ACEDB)

But you can also specify "pearson" format which reads the ID withoutparsing. Just add to the commandline:


notseq -sf pearson

Now you have another problem. This will not work for notseq!!!

The exclude string in notseq is a pattern. In processing the pattern,some pattern characters are removed:


        whitespace
        ',' and ';'
        '|'

So your exclude pattern cannot include any '|' chatracters.

As a workaround, you can exclude "*ORF26673_6" and the IDs will bepreserved.

For the next release we will allow '|' characters. When notseq was firstwritten there was a possibility to use regualr expressions, but now weonly use simple text matching so the pipe characters are not a problem.


Hope that helps

Peter

_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss

Re: [EMBOSS] notseq and fasta definition headers

Reply via email to