Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site (http://www.ensembl.org/index.html) and recognized that the format used on this site slightly diverges from what one gets from NCBI. Especially the ACCESSION number is not valid according to the pattern matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can thus not be parsed using the RichSequence.IOTools. This issue has already been discussed in this list before, but the solution was not to use files from Ensemble, but those from NCBI instead. However, the reason why the files from Ensembl are so important, is that they contain additional annotation, not provided by NCBI. For instance the feature "exon". The old parsers from the biojava.seq.io package are able to read in the files from this site. The Sequence objects can be enriched afterwards and be written to another genbank file. However, this again results in a file, which cannot be stored in a BioSQL database using Hibernate caused by the invalid accession number. The next problem is that even the old parsers do not treat this "rich" information from the Ensembl files properly. The feature "exon" becomes "any" when the sequence is enriched and written to a new GenBank file. Hence the benefit from the Ensembl annotation gets lost during paring and conversion. By the way, Ensembl also offers to write Embl-like files or other formats with the same problems as mentioned above. On the other hand, no matter which parser in BioJavaX I look up within the API documentation, I can always find a corresponding "Term" class, which states that this class "Implements some ...-specific terms", where the dots stand for the considered format like UniProt, GenBank, Embl and so forth. None of these Term classes provides any setters or add-methods, which would allow to define a new term like "exon". The structure of the parsers seems to me to be very sophisticated and it is not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only the designated BioJavaX classes? 2. How can I extend the terms so that not only "SOME X-specific terms" are included, but some more? And how do I tell the parser to use and apply these terms? Or more generally, can I somehow read in an ontology (for instance the GO), persist it in BioSQL and make use of the terms contained therein? 3. How can I persist a sequence from Ensembl within a BioSQL database using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas
begin:vcard
fn;quoted-printable:Andreas Dr=C3=A4ger
n;quoted-printable:Dr=C3=A4ger;Andreas
org;quoted-printable:Center for Bioinformatics T=C3=BCbingen (ZBIT);Lehrstuhl Rechnerarchitektur
adr;quoted-printable;quoted-printable:;;Sand 1;T=C3=BCbingen;Baden-W=C3=BCrttemberg;72076;Germany
email;internet:[EMAIL PROTECTED]
title:Dipl.-Bioinform.
tel;work:+49-7071-70436
tel;fax:+49-7071-5091
x-mozilla-html:FALSE
url:http://www-ra.informatik.uni-tuebingen.de/mitarb/draeger/
version:2.1
end:vcard

_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to