Hi,

In the context of this thread I think it is worth pointing out that the CON entries in EMBL exist in expanded form (i.e. with the sequence) on the EBI ftp server in the following forms:

EMBL CONTIGS EXPANDED ftp://ftp.ebi.ac.uk/pub/databases/embl/expanded_con

EMBL ANNOTATED CON
ftp://ftp/ebi.ac.uk/pub/databases/embl/annotated_con

For comments and suggestions regarding these entries please contact:
http://www.ebi.ac.uk/embl/Contact/
http://www.ebi.ac.uk/support/ - SUe subject 'EMBL'

R:)



Guy Bottu wrote:
Peter Rice wrote:
When reading a CON entry we need a database to use to read the true sequence and features.

If we are reading from a database we can add the information in the database definition.

How do we define a default to resolve EMBL CON entries?

Can we handle EMBL release and EMBL updates?

There are a number of practical issues :
- an entry with "join" information can come from a databank as well as from a file. - EMBL and GenBank CON entries refer to segments in the same databank, but RefSeq refers to GenBank. - a sequence presented to EMBOSS can be CON or ANN type but have already a re-assembled sequence (depending on where it comes from) - each site has its own DB entries in emboss.default, so code that explicitly says "search in embl" might not work

So, IMHO :
- We need code for two cases : embl format (for EMBL,...) and for GenBank format (for GenBank, RefSeq,...). The software must look whether there are CO respectively CONTIG lines in the entry, looking for CON in the ID line is not good. - for databank sequences : the DB entry in emboss.default should have a parameter that indicates in which databank to search for the segments. If a site has RefSeq and EMBL but no GenBank, then RefSeq could still use sequence information from EMBL. If there is no parameter in the DB entry EMBOSS could for embl or genbank format entries search by default in the same databank or simply not try the assembly (what do you think is the best ?). - for "personal" sequences from files : is more tricky. Maybe an associated or advanced parameter that says that if the input sequence is of "join" type it must use a databank or file to retrieve the sequences. E.g. -sjoin=xxx or -join=xxx. If xxx is a databank the seqgments can be retrieved using the standard method defined in emboss.default and if xxx is a file it can be searched sequentially.

There are still some issues :
- the program entret is for retrieving entries as they are rather then for processing sequence information. Should entret also try the assembly or not ? - feature information is another matter. Some entries have no or a very poor feature information but there are entries that have features that are different from the seqment entries (this is certainly so for the ANN entries in EMBL and for RefSeq). How should we handle this ?


    Guy Bottu,
    BEN
_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss
_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss

Reply via email to