>I have come across a problem with genbank files using the perl module >Bio::DB::GenBank. When I get the genbank sequence from NCBI and write the >sequence out to in genbank format the Locus line is missing the date.
Let me take a stab at answering this. What does it mean for something to be in the GenBank format? There are many answers. All start with the format definition given in the release notes, at ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt This document is very specific. For example, here is the current definition for the LOCUS line LOCUS 16Char_LocusName 99999999999 bp ss-snoRNA circular DIV DD-MMM-YYYY Positions Contents --------- -------- 01-05 LOCUS 06-12 spaces 13-28 Locus name 31-31 space 30-40 Length of sequence, right-justified 41-41 space 42-43 bp 44-44 space 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or ms- (mixed-stranded) 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA, snoRNA. Left justified. 54-55 space 56-63 'linear' followed by two spaces, or 'circular' 64-64 space 65-67 The division code (see Section 3.3) 68-68 space 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) (Note that this is the new LOCUS line format as the format changed last month.) The document makes no statement that those fields are optional, so in that regard the Bioperl 'genbank' output is not in GenBank format. On the other hand, BioJava is incorrect for saying the problem with the LOCUS line. If you read the format specification, section 3.1 says a GenBank file starts with a few header lines The first line of the file contains the file name in character positions 1 to 9 and the full database name (Genetic Sequence Data Bank) starting in column 22. The brief names of the files in this release are listed in section 2.2. The second line contains the date of the current release in the form `day month year', beginning in position 27. The fourth line contains the current GenBank release number. so BioJava should complain that the first few lines of the file should look like GBBCT1.SEQ Genetic Sequence Data Bank 15 December 2001 Bioperl doesn't generate GenBank output but then again BioJava will try to make some sense from non-GenBank files. There are two difficulties here. Most people face the parsing problem while fewer need to deal with the writing problem. There are two aspects to the parsing problem: there are a lot of formats, and each contains many fields, of which only a few are needed. Few people enjoy writing parsers so odds are that when you (or your friendly neighborhood bio* dealers) need to support a new format, you'll only read the essential fields from a file and ignore the other fields. If the input file is in the expected format then this works fine. The other aspect of this problem is the input doesn't need to be in the right format to be parsed. For example, you can add new fields to a SWISS-PROT entry and the BioJava parser will ignore the fields. This is good and bad. The good part is it allows a certain amount of future compatibility. ExPASy can add new line types to the file and BioJava will still read it. But the change might be unexpected. For example, SWISS-PROT once only allowed a single AC line. That changed in release 39, as mentioned in the release notes at: http://www.expasy.ch/txt/old-rel/relnotes.38.txt That silently broke the Bioperl code, because it assumed only a single AC line, and it ended up only saving the accession numbers from the last line. The patch to fix that error was added last year and is at http://www.bioperl.org/pipermail/bioperl-guts-l/2000-December/002624.html For an example that affect the Biopython code, my genbank parser requires the old-style LOCUS line and the parser is column based, so the fields must be in exactly the correct location. It will break with the new-style LOCUS line mentioned above, but if I used a field based parser instead of a column based parser it wouldn't have broken. Until this release there was no guarantee in how the LOCUS line would change over time, so I believe that breaking loudly is better then breaking silently. People take advantage of the leniency of the various readers. Structure people have to work with PDB files. The PDB defines dozens of different card types but for structure visualization and manipulation you can start by supporting only the ATOM, HETATM, and possibly END and ENDMDL fields. This is well known, and a lot of software will work fine with only those fields defined. Indeed, generating a well-formed PDB file is a lot of work. There are perhaps a half-dozen programs capable of doing so, and they are all related to PDB submission (XPLOR, O, etc.) The leniency applies to the GenBank format as well. I just did a google search for "genbank file format" and came up with: http://www.icp.ucl.ac.be/~opperd/private/cox2seq.html which claims "COX2 sequences in GenBank format." But the entries look like this LOCUS cox2_leita 176 bp DEFINITION cox2_leita 421 bp, 176 bases, F21F24E0 checksum. ORIGIN 1 MAFILSFWMI FLLDSVIVLL SFVCFVCVWI CALLFSTVLL VSKLNNIYCT 51 WDFTASKFID VYWFTIGGMF SLGLLLRLCL LLYFGHLNFV SFDLCKVVGF 101 QWYWVYFIFG ETTIFSNLIL ESDYMIGDLR LLQCNHVLTL LSLVIYKLWL 151 SAVDVIHSFA ISSLGVKVEN LVAVMK // This isn't in GenBank format. Some parsers are lenient enough to read this. Others are not. (There are a few other aspects to the parsing problem I just missing. Some people extend a format, to add support for new data typs. For example, http://www.tigr.org/~jeisen/GDE/AppendixA.html The Genbank character set has also been extended in order to support these other data types. In other cases the format documentation is incorrect, or at least lacking, while in still others the database is in an obviously incorrect format.) The writing problem. Assume you've read the data from one format and want to generate it in another format. You don't necessarily have all the information needed for that. Either you have to say it can't be done or you have to lie and generate invalid data. You can make a direct lie and generate false information, and make a false date for the LOCUS line. Or you can lie through omission and not put anything there (eg, use spaces or generate a short line). The first produces syntactically valid output, but that false data may accidentally be treated as valid. The second produces incorrectly formatted data but is less likely to corrupt your other data. (I use "lie" somewhat provocatively. In reality the software should respond to a request for something in GenBank format by saying it can't generate a GenBank file but can generate something which is GenBank-like, and somehow describe the differences. Documenting this nuance and those differences is itself hard and tedious, which is why it is rarely done outside of pointers to source code.) In then happens that there's a de facto consensus amoung those in the know about what constitutes a "correct" GenBank file. This consensus isn't documented and is learned mostly from experience. (Eg, a common consensus is to skip everything in the file up to the LOCUS line and still call it a GenBank file. This supports both NCBI distributed GenBank data files and the "sequence of GenBank entries" format generated by most software.) I don't like the current situation. I think it makes it too hard for people to work with these files, especially since mastering file formats doesn't advance science one whit. So I've been working for the last couple years on ways to improve it, and think I've a solution, which I've been contributing to biopython. But that's going to be several long emails in its own right - I'm planning on presenting it publically for the first time at the Tuscon conferences. >Is the date a required element in the Locus line? Is there consensus >on what constitutes correct format? Has it changed recently? Yes. Mostly, but not really. No, but the rest of the line has. (Though that's not the problem.) >I also noticed that the biojava parser is very picky about the number of >spaces; delete a few spaces between DNA and linear and it dies too. Under the old-style definition that was an appropriate choice, since the fields in that line were defined by column positions. New-style also has column definitions, but says that a field based interpretation will always be valid for the future. Andrew [EMAIL PROTECTED] _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l