Hi,
In the context of this thread I think it is worth pointing out that the
CON entries in EMBL exist in expanded form (i.e. with the sequence) on
the EBI ftp server in the following forms:
EMBL CONTIGS EXPANDED ftp://ftp.ebi.ac.uk/pub/databases/embl/expanded_con
EMBL ANNOTATED CON
ftp://ftp/ebi.ac.uk/pub/databases/embl/annotated_con
For comments and suggestions regarding these entries please contact:
http://www.ebi.ac.uk/embl/Contact/
http://www.ebi.ac.uk/support/ - SUe subject 'EMBL'
R:)
Guy Bottu wrote:
Peter Rice wrote:
When reading a CON entry we need a database to use to read the true
sequence and features.
If we are reading from a database we can add the information in the
database definition.
How do we define a default to resolve EMBL CON entries?
Can we handle EMBL release and EMBL updates?
There are a number of practical issues :
- an entry with "join" information can come from a databank as well as
from a file.
- EMBL and GenBank CON entries refer to segments in the same databank,
but RefSeq refers to GenBank.
- a sequence presented to EMBOSS can be CON or ANN type but have already
a re-assembled sequence (depending on where it comes from)
- each site has its own DB entries in emboss.default, so code that
explicitly says "search in embl" might not work
So, IMHO :
- We need code for two cases : embl format (for EMBL,...) and for
GenBank format (for GenBank, RefSeq,...). The software must look whether
there are CO respectively CONTIG lines in the entry, looking for CON in
the ID line is not good.
- for databank sequences : the DB entry in emboss.default should have a
parameter that indicates in which databank to search for the segments.
If a site has RefSeq and EMBL but no GenBank, then RefSeq could still
use sequence information from EMBL. If there is no parameter in the DB
entry EMBOSS could for embl or genbank format entries search by default
in the same databank or simply not try the assembly (what do you think
is the best ?).
- for "personal" sequences from files : is more tricky. Maybe an
associated or advanced parameter that says that if the input sequence is
of "join" type it must use a databank or file to retrieve the sequences.
E.g. -sjoin=xxx or -join=xxx. If xxx is a databank the seqgments can be
retrieved using the standard method defined in emboss.default and if
xxx is a file it can be searched sequentially.
There are still some issues :
- the program entret is for retrieving entries as they are rather then
for processing sequence information. Should entret also try the assembly
or not ?
- feature information is another matter. Some entries have no or a very
poor feature information but there are entries that have features that
are different from the seqment entries (this is certainly so for the ANN
entries in EMBL and for RefSeq). How should we handle this ?
Guy Bottu,
BEN
_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss
_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss