john walshaw (JIC) wrote:
 > I'm having trouble getting seqret to return the expected FASTA-header
style when using the 'ncbi' output sequence format, when applying it to
either the native UniProt data files or an EMBOSS database made from
them.
In the manual for seqret, in the section "Output Format...", this is the
description of the "ncbi" style of FASTA format:
ncbi multiple NCBI style FASTA format with the database name, entry
   name and accession number separated by pipe ("|") characters.

This could be extended to explain that NCBI also have an annoyingly short list of valid database names. Any other name has to appear as "gnl|dbname", as you see for your uniprox database indexd with dbxflat. We use "unk" if we have no known database name, but we treat it as a general name - NCBI's "unk|identifier" is something special to them.

If you use one of the "NCBI list" database names, for example adding "-sdbname sp" to the command line, you will get a swissprot NCBI sandard identifier - but this is because "sp" is one of their special names. You cannot even assume the data is protein if you see "sp" in the identifier (genpept for example uses emb and gb as database names for protein sequences).

By the way, is there a way of making seqret return the same style header
as WU-BLAST sp2fasta, i.e. >db|accno|id  ....  (instead of
db|id|accno), or is this what the ncbi format is intended to do?

Hmmmm .... yet another FASTA format (and see below for another one). Yes, that looks like a good idea. We need an output name for it, perhaps wublast is the best choice.

You emntioned UniProt 14 - the latest release also includes extensions to the Fasta format description to tag species and other information. We are considering making this the default version of the FASTA format for EMBOSS so we can preserve more information - does this sound like a good idea?

For example: >sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1


Also on the subject of UniProt 14 - the .dat flat files have a new syntax for the DE lines. we had to ignore that as the cange appeared just before EMBOSS 6.0.0 Is anyone interested in having the details parsed out, or in having the original friendly description generated?

ID   104K_THEAN              Reviewed;         893 AA.
AC   Q4U9M9;
DT   18-APR-2006, integrated into UniProtKB/Swiss-Prot.
DT   05-JUL-2005, sequence version 1.
DT   22-JUL-2008, entry version 18.
DE   RecName: Full=104 kDa microneme/rhoptry antigen;
DE   AltName: Full=p104;
DE   Flags: Precursor;

Hope this helps, even if it adds some new questions!

Peter
_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss

Reply via email to