Hello people!
Mark is away so I'm taking the liberty of sneaking this one out... :)
I've cross-posted this to both BioJava and BioSQL as much of what is new in
BioJavaX will probably be of interest to BioSQL users too.
We've been doing a lot of work recently on creating some extensions to BioJava
called BioJavaX. Primarily the purpose of these extensions is to provide better
interaction with BioSQL databases, which has been achieved using Hibernate
(www.hibernate.org). You can now fully interact with every column of every
table in BioSQL, using Hibernate's own HQL language to construct queries that
result in sets of BioJavaX objects. Selects, inserts, updates, primary key
assignment, foreign key relations, and deletes are all handled transparently by
Hibernate, removing the need for any SQL at all to be included in BioJavaX.
As a side effect of constructing a Hibernate-compatible extension to the
BioJava object model, we were required to define objects that hold much more
detailed information about themselves. For instance, a Sequence object cannot
tell you what namespace it lives in in the BioSQL database, but our extension
to it, RichSequence, can. As RichSequence extends Sequence and doesn't replace
it, this means you can use the new objects with your existing code without any
hassle casting them.
To be able to load information from files into these new RichSequence objects
in a meaningful way, we had to create a more detailed SeqIOListener, called
RichSeqIOListener. Then, we had to create new file parsers for the common file
formats which were able to extract more detailed information than before in
order to satisfy the RichSeqIOListener.
It's pretty safe to say that the file parsers in BioJavaX are leagues ahead of
the existing ones in BioJava, even if I do say so myself. :P The downside of
this extra detail though is that the parsers are much more sensitive and will
not play well at all with incomplete or incorrectly formed files. If someone
can edit them to be less sensitive whilst still retaining the level of detail
required, that'd be great.
We've included parsers for FASTA, GenBank, EMBL, UniProt, INSDseq, EMBLxml,
UniProtXML, and an extra one for parsing NCBI Taxonomy data.
Do note that BioJavaX cannot fully convert sequences created using the old
BioJava model into the new BioJavaX model. It'll do its best, but the
RichSequence object you'll end up with will have lots of properties set to null
and a tonne of annotations instead, pretty much the same as the original
Sequence object I suppose. So its best to try to avoid conversions and deal
with RichSequence objects from the ground up. This is particularly important to
consider when converting a BioSQL database previously used with BioJava into
one for use with BioJavaX. You'll also find that if you pass a converted
old-style Sequence object to one of the new file parsers for writing it may
fail or produce output with lots of missing fields, as it will not find the
information it is looking for in the places it expects.
The whole lot is specifically designed to mimic and be compatible with BioSQL,
but you don't need to have a BioSQL database to use it. Everything is
standalone and will work just fine without a backing data source. Also there is
no reason why you couldn't create a new set of Hibernate mappings that map the
BioJavaX object model to some other relational database schema of your choice.
The upshot of it all is the org.biojavax package, which you can find in
biojava-live branch on CVS. Development is pretty much complete, and it now
needs some serious testing.
We need volunteers to:
a) test the BioSQL interaction via Hibernate with the various database
flavours supported (HSQL, Oracle, MySQL, PostGreSQL)
b) test the various file formats, particularly looking for special-case
exceptions which the parsers may not be aware of yet
c) do some load-testing and help us find ways to improve it if it turns
out to be too slow when under pressure
Documentation of the new features can be found in DocBook XML format in
docs/docbook/BioJavaX.xml in the biojava-live branch of CVS. It's as detailed
as I could make it without getting bored to death writing it. I've never been
the world's best documentation writer, so if anyone would like to help improve
it you're more than welcome.
Our plan is to make all this an official part of BioJava come the 1.5 release,
whenever that may be. For now though it is very very much a testing-stage
thing, not even an alpha release.
Questions on a postcard to either Mark or myself. Feedback most welcome.
cheers,
Richard
Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000 DID: (65) 6478 8199
Email: [EMAIL PROTECTED]
---------------------------------------------
This email is confidential and may be privileged. If you are not the intended
recipient, please delete it and notify us immediately. Please do not copy or
use it for any purpose, or disclose its content to any other person. Thank you.
---------------------------------------------
_______________________________________________
Biojava-l mailing list - [email protected]
http://biojava.org/mailman/listinfo/biojava-l