Hi Ulrik, sorry for the slow response. The genbank parser is a very recent feature (see here https://github.com/biojava/biojava/pull/41 ) As such I am not surprised that there is additional details missing. The second issue that you are hitting on is that our feature-framework is not as good developed as it should be. Did you see this?
http://www.biojava.org/docs/api/org/biojava3/core/sequence/features/package-summary.html So far it seems only the UniProt parser supports Database cross references. Perhaps we can extend the genbank parser in a similar way? Andreas On Fri, Sep 13, 2013 at 1:04 PM, Ulrik Stervbo <[email protected]>wrote: > Dear List, > > For a smaller project of mine have I written a GenBank parser to read and > save genbank files. I would like to share the code, but I am having a hard > time finding my way around the BioJava source (I have no experience in > larger software projects). > > I have noticed in the GenbankSequenceParser.java, that various genbank > entries are ignored. These are the KEYWORDS, SOURCE, REFERENCE, and > COMMENT. Is this true, or am I missing something? It further seems that the > qualifiers for each feature is ignored. Again I may be missing something. > > Is this because the Sequence object cannot handle this information? > > In general, it seems that the current genbank parser is ignoring a lot of > information, accession numbers other than the first one, GI-version and > the date of the submission (clever use of regex to parse the first line - I > didn't think of that, but was inspired by the more crude approach of > BioPerl, to be able to handle slightly malformed first lines). > > The parser I have written, extract all the information described in the > genbank format description (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt), > even > if the file is not well formed. Saving the genbank file results in a well > formed genbank. The only requirement in my parser is that the 3 blocks > Annotation, Features and Sequence are in the correct order. My parser > returns a list of SequenceObject, and is thus capable of handling several > genbank entries in a single file (as hinted in the genbank format > description). > > Like the current genbank parser in BioJava, I have not implemented handling > of the CONTIG element. > > My implementation is slightly different, and probably less efficient than > the current one, as mine uses a lot of while loops. The advantage of this > is that the assumptions are limited. > > The first block of the genbank file is the most complex, consisting of > several Keywords which can occur several times and span several lines. For > each of the recurring keywords, a List is generated, and for those (few) > keywords which can occur only once a string or int is returned. > > The keywords SOURCE and REFERENCE are more complex keywords as they also > contain subkeywords. This I deal with in that these are stored in a list > of hashmaps. > > My parser reads locations in all their complexity, including join with > different accession ids. All qualifiers are stored in a LinkedHash. (I just > realized this was a bad idea and will change it to a List to accomodate for > keeping the original order and allow repeated qualifier key. > > The writer looks for element in a specific order and adds appropriate > whitespaces to generate a well formed genbank file. With all my example > files, the output is an exact copy of the input (checked with the diff > command) > . > If I can get some pointers how to integrate this in the current codebase, I > would be happy to start adding. > > I have no idea of what elements other file formats provide, and how this > can be unified, but am open for discussion. > > Cheers, > Ulrik > > PS. My project also includes drawing linear and circular sequences with > features. Is there a side project for these things running? I have seen > some drawing of linear sequences, but could not get to the project. For my > drawing of circular sequences, I would lend and lift from the plasmapper, > which cannot be directly utilized due to some design decisions in > plasmapper. > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
