Dear List, For a smaller project of mine have I written a GenBank parser to read and save genbank files. I would like to share the code, but I am having a hard time finding my way around the BioJava source (I have no experience in larger software projects).
I have noticed in the GenbankSequenceParser.java, that various genbank entries are ignored. These are the KEYWORDS, SOURCE, REFERENCE, and COMMENT. Is this true, or am I missing something? It further seems that the qualifiers for each feature is ignored. Again I may be missing something. Is this because the Sequence object cannot handle this information? In general, it seems that the current genbank parser is ignoring a lot of information, accession numbers other than the first one, GI-version and the date of the submission (clever use of regex to parse the first line - I didn't think of that, but was inspired by the more crude approach of BioPerl, to be able to handle slightly malformed first lines). The parser I have written, extract all the information described in the genbank format description (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt), even if the file is not well formed. Saving the genbank file results in a well formed genbank. The only requirement in my parser is that the 3 blocks Annotation, Features and Sequence are in the correct order. My parser returns a list of SequenceObject, and is thus capable of handling several genbank entries in a single file (as hinted in the genbank format description). Like the current genbank parser in BioJava, I have not implemented handling of the CONTIG element. My implementation is slightly different, and probably less efficient than the current one, as mine uses a lot of while loops. The advantage of this is that the assumptions are limited. The first block of the genbank file is the most complex, consisting of several Keywords which can occur several times and span several lines. For each of the recurring keywords, a List is generated, and for those (few) keywords which can occur only once a string or int is returned. The keywords SOURCE and REFERENCE are more complex keywords as they also contain subkeywords. This I deal with in that these are stored in a list of hashmaps. My parser reads locations in all their complexity, including join with different accession ids. All qualifiers are stored in a LinkedHash. (I just realized this was a bad idea and will change it to a List to accomodate for keeping the original order and allow repeated qualifier key. The writer looks for element in a specific order and adds appropriate whitespaces to generate a well formed genbank file. With all my example files, the output is an exact copy of the input (checked with the diff command) . If I can get some pointers how to integrate this in the current codebase, I would be happy to start adding. I have no idea of what elements other file formats provide, and how this can be unified, but am open for discussion. Cheers, Ulrik PS. My project also includes drawing linear and circular sequences with features. Is there a side project for these things running? I have seen some drawing of linear sequences, but could not get to the project. For my drawing of circular sequences, I would lend and lift from the plasmapper, which cannot be directly utilized due to some design decisions in plasmapper. _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
