[Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate

Andreas Dräger Thu, 17 Jul 2008 04:51:12 -0700

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site(http://www.ensembl.org/index.html) and recognized that the format usedon this site slightly diverges from what one gets from NCBI.Especially the ACCESSION number is not valid according to the patternmatcher in class org.biojavax.bio.seq.io.GenbankFormat and the files canthus not be parsed using the RichSequence.IOTools.This issue has already been discussed in this list before, but thesolution was not to use files from Ensemble, but those from NCBIinstead. However, the reason why the files from Ensembl are soimportant, is that they contain additional annotation, not provided byNCBI. For instance the feature "exon".The old parsers from the biojava.seq.io package are able to read in thefiles from this site. The Sequence objects can be enriched afterwardsand be written to another genbank file. However, this again results in afile, which cannot be stored in a BioSQL database using Hibernate causedby the invalid accession number. The next problem is that even the oldparsers do not treat this "rich" information from the Ensembl filesproperly. The feature "exon" becomes "any" when the sequence is enrichedand written to a new GenBank file. Hence the benefit from the Ensemblannotation gets lost during paring and conversion. By the way, Ensemblalso offers to write Embl-like files or other formats with the sameproblems as mentioned above.On the other hand, no matter which parser in BioJavaX I look up withinthe API documentation, I can always find a corresponding "Term" class,which states that this class "Implements some ...-specific terms", wherethe dots stand for the considered format like UniProt, GenBank, Embl andso forth. None of these Term classes provides any setters oradd-methods, which would allow to define a new term like "exon". Thestructure of the parsers seems to me to be very sophisticated and it isnot very easy to extend the parsers or term classes for own purposes.

Therefore, I would like to ask the following questions:

1. Is there a way to read in files downloaded from Ensembl using onlythe designated BioJavaX classes?2. How can I extend the terms so that not only "SOME X-specific terms"are included, but some more? And how do I tell the parser to use andapply these terms? Or more generally, can I somehow read in an ontology(for instance the GO), persist it in BioSQL and make use of the termscontained therein?3. How can I persist a sequence from Ensembl within a BioSQL databaseusing Hibernate even though they use different accession numbers?

I am grateful for any answers.


Cheers
Andreas

begin:vcard
fn;quoted-printable:Andreas Dr=C3=A4ger
n;quoted-printable:Dr=C3=A4ger;Andreas
org;quoted-printable:Center for Bioinformatics T=C3=BCbingen (ZBIT);Lehrstuhl Rechnerarchitektur
adr;quoted-printable;quoted-printable:;;Sand 1;T=C3=BCbingen;Baden-W=C3=BCrttemberg;72076;Germany
email;internet:[EMAIL PROTECTED]
title:Dipl.-Bioinform.
tel;work:+49-7071-70436
tel;fax:+49-7071-5091
x-mozilla-html:FALSE
url:http://www-ra.informatik.uni-tuebingen.de/mitarb/draeger/
version:2.1
end:vcard

_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

[Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate

Reply via email to