BioHackathon Over the last 6 weeks we have held the first "hackathon" where developers of the open-bio projects met. The hackathon was split over two sessions, the first one being at the O'Reilly Bioinformatics Technology Conference at Arizona and the second in Cape Town South Africa organised by Electric Genetics. As well as this key support from O'Reilly and Electric Genetics, the hackathon was additionally sponsored by Astra Zeneca and Dalke Scientific. All the code generated was immediately committed to the publically accessible cvs system on open-bio (instructions at http://cvs.open-bio.org/). The hackathon drew together 20 developers across a number of different open source projects. Our aim was to develop an infrastructure for accessing sequence databases transparently that scales from a small single computer in a molecular biology lab to a large scale pipeline project. This infrastructure can be transparently shared between the different language projects - eg, building a sequence database in BioPerl but accessing it from BioJava. The hope is that we can both reduce the time it takes to build and test applications in different languages and, at the same time, reduce the overhead in managing and deploying sequence databases in bioinformatics installations. Aware of the need for snazzy acronyms for standards to allow people to dazzle their managers/sales force/bosses we have named this the "Open Bioinformatics Database Access" scheme (OBDA for short). We settled on a standard set of 6 implementations to retrieve sequences, differing in their complexity, network requirements and throughput. In all cases we were taking an existing system from an open source project and wherever possible we followed existing standards. Having discussed the specifications of these methods we then implemented the system in 5 languages - Perl, Java, Python, Ruby and C (not all languages got all implementations due to limitations in programming time, but Perl, Java and Python had a full suite). The implementations where then tested between different languages to ensure programmatic and data transfer capibilities. Finally the different methods were performance tested and a number of performance bottlenecks removed. There are more technical details at the end of this mail and a list of what each participant achieved. At the same time a number of other projects were advanced. A framework for Bibliographic objects was discussed and Perl and Java code provided. The Genquire Perl GUI was adapted to work on top of aspects of the OBDA system. Bio::Graphics, a GIF drawing system for Perl was integrated into BioPerl. The OmniGene project became more plug-and-play with BioJava. One important corollary of our work was strengthening the common conceptual view of our data. For the last five years all the projects have by and large been sticking to a common core of EMBL/GenBank format information in their data model. It was unclear how to extend this model into other areas without losing cross-project interoperability. The requirement of all projects to read and write to a relational database (BioSQL) forced us to re-examine our common data model away from the perspective of a data format. The result was in fact closer cooperation and a clearer understanding of how to extend our data models in cross project compatible manner. In particular we have decided to make ontology integration an explicit option for our information, allowing more flexibility and richness in describing the additional data attached to sequences. Finally, we had fun. Some of that fun was deliberately scheduled such as the trip to the fast-food mexican "chuys" joint in Tuscon where we aquired a stuffed toy (which became our mascot). South Africa was a real eye opener for us, with incredible scenery, lovely people and real attention to detail from our hosts, Electric Genetics. But we are also hackers, and all of us got a kick out of simply being able to work together with few distractions and an open 802.11b network. Having a turn around time of minutes in a Q/A session, rather than potential days when people are working via email in different time zones was sensational. All the projects and open-bio in general was strengthened immeasurably by the hackathon. We'd like to thank our organisors (Electric Genetics and O'Reilly) and sponsors (Astra Zeneca and Dalke Scientific) and in general the support from open-bio over this time. Person by Person report. ------------------------ Michele Clamp: Perl flatfile indexing works quite fast - into Ensembl production in next month Heikki Levashilo: BioFetch has been taking more work than expected. Server side has one outstanding bug in error reports; BioPerl implementation keeps changing (more generic) to use with wide variety of dbs RefSeq, SP, EMBL are in James Cuff: testing arm; overview of all different language projects. Scaleability and information transfer testing. Steve Searle: C Berkeley db & flatfile connecters Lincoln Stein: Berkeley DB/Perl implementation; very fast but not as fast as C. Bio::Graphics into Bioperl. File Caching in Bioperl. Martin Senger: Tie ins from some languages to do biblio as web service; feew more implementation and testing; BioPerl is most complete; Java is complete but needs commenting; Python has also made good progress. Chris Mungal: BioSQL core is pretty much done; BioPerl DB both Postgres and MySQL. Also will be working on ontology module for both BioSQL and BioPerl. Brian Gilman: BioJava hooks and BioSQL backended. Will put up DAS server at home wthat serves anything that is in BioSQL. Will make ER diagram from BioSQL DDL and put on web so people can see it. Elia Stupka: Registry in Bioperl. Promised world-accessible BioSQL server of EMBL will be put up when he gets home. Ewan Birney: Bioperl CORBA, Memory caching in Bioperl. Performance enhancement of parsers. Katayama Toshiaki: BioRuby BioSQL, Biofetch client and server and Registry Jason Staijch: Bioperl CORBA and prepping for 1.0 release of BioPERL. Make decisions on how to release. Will include registry, index, Medline, parser, etc. Hopefully people will hammer on it and will get feedback from people who aren't such good programmers. Thomas Down: BioJava - all this stuff was cut off from 1.2. BioJava should releas 1.3 in ~2 months or less to include this stuff. Testing on BioSQL, working nicely. Tidied up BioJava registry code added access point for normal users. BioFetch client/server and CORBA interoperability. Mark Wilkinson: GenQuire on Windows bugs have been squashed and put up for download. BioSQL server. GenQuire works with schema if only one contig is specified. All genes displayed in + strand. May not all be in GenQuire, might be in BioSQL adapter. Chris Dadigidan: Some BioPERL; working BioSQL in Boston with GenBank. Love to start playing with client/server stuff cross-language. Documenting for system admin Andrew Dalke: Flatfile indexing with Flastfile and BerkeleyDB; regression testing. Matthew Pocock: Java flatfile indexer; debugging and performance enhancing. Brad Chapman: Registry in Python, BioSQL indexing, BioCORBA and regular http is all hooked in to get from one interface. BioSQL is all set with new schema. Technical Details ----------------- The OBDA specifications are available via anonymous cvs from cvs.bioperl.org, /home/repository/obf-common cvs module obda-specs. We will have a web page off open-bio.org soon and we are hoping to publish a paper on OBDA this year. In brief, the 6 implementations are: (1) Flat file, raw index. This implementation requires no additional technology than reading files. It works off a fixed-length sorted record with byte offsets into a flat file dump of sequence. (2) Flat file, Berkeley DB. This implementation is the same data model as the flat file index, but using Berkeley DB as the back-end store having byte offsets into a flat file dump of sequences. (3) Biofetch. This is a simple EMBL/GenBank/Fasta format over http: protocol, where clients have to provide a suitably formatted query string and the server responds with the entry as a ascii stream over http. (4) XEMBL. This is a SOAP protocol with the data format being one of Agave XML or BSML XML. We are debating how much stress we should put on this as Biofetch seems to work cleaner for us currently. (5) BioCorba. We use the BSANE/BioCorba 0.5 spec and did cross-platform testing. (6) BioSQL. A relational schema which we tested on both MySQL and Postgres. This was perhaps the project which stretched our conceptual understanding of the area the most, with gratifying results as we round-tripped information between the different projects. Finally there is a simple discovery system (called the "Registry") which associates database namespaces (eg, EMBL) with implementations (eg, BioSQL at this location). The Registry is found by searching the path $HOME/.bioinformatics/seqdatabase.ini /etc/bioinformatics/seqdatabase.ini http://www.open-bio.org/registry/seqdatabase.ini The aim here is to have a path of personal, local and internet-wide specifications for where databases. The internet accessible registry will mean that just by installing bioperl users will get transparent (if potentially a little slow) access to databases. We expect the "local" configuration mode to be the most widely used across bioinformatics installations. (the web accessible registry is currently just a testing version. Once we have built up the correct services worldwide we will replace it with a set of internet accessible services). We will be setting up soon (Chris - is it up already?) a mailing list explicitly for cross-project projects and in particular to allow the development of the common data/concept model to be put in place. ----------------------------------------------------------------- Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420 <[EMAIL PROTECTED]>. ----------------------------------------------------------------- _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l
