Hi all,

This is a two-pronged change proposal - first to allow BioJava to make correct 
use of the bioentry_dbxref tables in BioSQL, and second to allow it to parse 
reference information correctly from EMBL, Genbank, Genpept, GenXML, and 
SwissProt records and store them within Sequence objects in a consistent manner.

Currently, references are loaded from only some of the above formats. Depending 
on the format, they are stored in different ways within Sequence object. 

Genbank references are stored with each line of the record as a separate 
annotation. eg. one annotation with a key saying REFERENCE and value giving a 
location, another with a key saying AUTHOR and a value listing them, etc. etc. 
As simple String/String annotations, they get persisted to the 
bioentry_qualifer_value table in BioSQL. As multiple references are read, they 
get stored with the same keys, so you end up with Annotations for these keys 
containing ArrayLists of potentially different arity, depending on which of the 
original references had which optional fields included (eg. PUBMED or MEDLINE). 
This makes it impossible to accurately reconstruct the original reference 
information when exporting the sequence to a file.

EMBL/Swissprot references do almost the same thing, except the parser here 
gathers up the various reference tags from the file and wraps each set in its 
own ReferenceAnnotation class, which is just a map which gets flattened out and 
persisted to bioentry_qualifier_value as String/String annotation pairs as 
above. When loaded back in from BioSQL the ReferenceAnnotation objects are not 
recreated, and you end up with the same ArrayList problem as above, leading to 
the same problem when trying to export the sequence to a file.

Another problem here is that the two approaches only understand their own 
methods when it comes to exporting references in their own file formats. So, 
the Genbank exporter cannot export references that were loaded from 
EMBL/Swissprot, and vice versa.

Not good!

So, I propose the following:

        1) Change the file format parsers above to create, when reading 
sequences from file, an org.biojava.bibliography.BibRef objects for each 
inputted reference. This object can then be stored against the Sequence as an 
annotation, with the key of BibRef.class. As with all other kinds of 
annotation, if multiple references are loaded then the value of the annotation 
should be an ArrayList of the various BibRef objects. If only one reference is 
loaded, then the value should be the single BibRef object itself.
        2) Change the file format parsers above to understand, when writing 
sequences to file, how to convert BibRef annotations into their own formats.
        3) There is no restriction on which of the established BibRef subtypes 
from org.biojava.bibliography.* you can actually use to annotate the sequence. 
Usually you'll be wanting a BiblioJournalArticle object. However, you MUST use 
certain fields as follows:
                a) use the 'identifier' field to store the PubMed or MedLine ID 
(purely the ID, not prefixed with anything).
                b) use the 'publisher' field to store a BiblioOrganisation 
object with name set to 'PUBMED' or 'MEDLINE' as appropriate (must be upper 
case - if not, it will get changed to upper case on persistence to BioSQL, so 
you might as well stick it in upper case to start with).
                c) use the 'type' field to store a TYPE_* value from 
BibRefSupport to indicate what sort of resource this reference refers to (in 
most cases you'll want TYPE_JOURNAL_ARTICLE).
        4) To alter BioSQLSequenceDB.persistBioentryProperty() to check for 
annotations with the key of BibRef.class or any of its established subtypes as 
above, and use special behaviour to persist these to the bioentry_dbxref table 
(and related tables as appropriate).
        5) To alter BioSQLSequenceAnnotation.initAnnotations() to check for and 
load the bioentry_dbxref data as BibRef.class annotations.

Any suggestions/changes/volunteers/violent objections? I can manage steps 4 and 
5 myself quite easily, but will need help from everyone out there in updating 
the file parsers to use this proposed mechanism.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: [EMAIL PROTECTED]
---------------------------------------------
This email is confidential and may be privileged. If you are not the intended 
recipient, please delete it and notify us immediately. Please do not copy or 
use it for any purpose, or disclose its content to any other person. Thank you.
---------------------------------------------


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l

Reply via email to