[Biojava-l] BioJavaX ready for testing

Richard HOLLAND Tue, 01 Nov 2005 13:49:23 -0800

Hello people!

Mark is away so I'm taking the liberty of sneaking this one out... :)


I've cross-posted this to both BioJava and BioSQL as much of what is new in 
BioJavaX will probably be of interest to BioSQL users too.

We've been doing a lot of work recently on creating some extensions to BioJava 
called BioJavaX. Primarily the purpose of these extensions is to provide better 
interaction with BioSQL databases, which has been achieved using Hibernate 
(www.hibernate.org). You can now fully interact with every column of every 
table in BioSQL, using Hibernate's own HQL language to construct queries that 
result in sets of BioJavaX objects. Selects, inserts, updates, primary key 
assignment, foreign key relations, and deletes are all handled transparently by 
Hibernate, removing the need for any SQL at all to be included in BioJavaX.

As a side effect of constructing a Hibernate-compatible extension to the 
BioJava object model, we were required to define objects that hold much more 
detailed information about themselves. For instance, a Sequence object cannot 
tell you what namespace it lives in in the BioSQL database, but our extension 
to it, RichSequence, can. As RichSequence extends Sequence and doesn't replace 
it, this means you can use the new objects with your existing code without any 
hassle casting them.

To be able to load information from files into these new RichSequence objects 
in a meaningful way, we had to create a more detailed SeqIOListener, called 
RichSeqIOListener. Then, we had to create new file parsers for the common file 
formats which were able to extract more detailed information than before in 
order to satisfy the RichSeqIOListener. 

It's pretty safe to say that the file parsers in BioJavaX are leagues ahead of 
the existing ones in BioJava, even if I do say so myself. :P The downside of 
this extra detail though is that the parsers are much more sensitive and will 
not play well at all with incomplete or incorrectly formed files. If someone 
can edit them to be less sensitive whilst still retaining the level of detail 
required, that'd be great.

We've included parsers for FASTA, GenBank, EMBL, UniProt, INSDseq, EMBLxml, 
UniProtXML, and an extra one for parsing NCBI Taxonomy data.

Do note that BioJavaX cannot fully convert sequences created using the old 
BioJava model into the new BioJavaX model. It'll do its best, but the 
RichSequence object you'll end up with will have lots of properties set to null 
and a tonne of annotations instead, pretty much the same as the original 
Sequence object I suppose. So its best to try to avoid conversions and deal 
with RichSequence objects from the ground up. This is particularly important to 
consider when converting a BioSQL database previously used with BioJava into 
one for use with BioJavaX. You'll also find that if you pass a converted 
old-style Sequence object to one of the new file parsers for writing it may 
fail or produce output with lots of missing fields, as it will not find the 
information it is looking for in the places it expects. 

The whole lot is specifically designed to mimic and be compatible with BioSQL, 
but you don't need to have a BioSQL database to use it. Everything is 
standalone and will work just fine without a backing data source. Also there is 
no reason why you couldn't create a new set of Hibernate mappings that map the 
BioJavaX object model to some other relational database schema of your choice.

The upshot of it all is the org.biojavax package, which you can find in 
biojava-live branch on CVS. Development is pretty much complete, and it now 
needs some serious testing.

We need volunteers to:

        a) test the BioSQL interaction via Hibernate with the various database 
flavours supported (HSQL, Oracle, MySQL, PostGreSQL)
        b) test the various file formats, particularly looking for special-case 
exceptions which the parsers may not be aware of yet
        c) do some load-testing and help us find ways to improve it if it turns 
out to be too slow when under pressure

Documentation of the new features can be found in DocBook XML format in 
docs/docbook/BioJavaX.xml in the biojava-live branch of CVS. It's as detailed 
as I could make it without getting bored to death writing it. I've never been 
the world's best documentation writer, so if anyone would like to help improve 
it you're more than welcome.

Our plan is to make all this an official part of BioJava come the 1.5 release, 
whenever that may be. For now though it is very very much a testing-stage 
thing, not even an alpha release.

Questions on a postcard to either Mark or myself. Feedback most welcome.

cheers,
Richard


Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: [EMAIL PROTECTED]
---------------------------------------------
This email is confidential and may be privileged. If you are not the intended 
recipient, please delete it and notify us immediately. Please do not copy or 
use it for any purpose, or disclose its content to any other person. Thank you.
---------------------------------------------


_______________________________________________
Biojava-l mailing list  -  [email protected]
http://biojava.org/mailman/listinfo/biojava-l

[Biojava-l] BioJavaX ready for testing

Reply via email to