That'd be nice, except the DTD has bugs in it! I've pointed this out to them already but no fixes have been made yet.
On Wed, 2006-06-07 at 17:09 +0800, [EMAIL PROTECTED] wrote: > Presumably the XML it produces should validate against the dtd? It should > also parse anything that validates against the dtd. I think that would be > the base line for behaivour of the parser. > > > > > > > Richard Holland <[EMAIL PROTECTED]> > Sent by: [EMAIL PROTECTED] > 06/07/2006 05:01 PM > > > To: Seth Johnson <[EMAIL PROTECTED]> > cc: [email protected], (bcc: Mark > Schreiber/GP/Novartis) > Subject: Re: [Biojava-l] Parsing Genbank/EMBL/XML Sequences > from binary NCBI ASN.1 > daily update files > > > OK, I've updated INSDseqFormat to 1.4, or my interpretation of it based > on what the guys next door told me. Please let me know if you have > trouble running the XML it produces through any other parsers that can > read it, or if it throws a wobbly whilst reading stuff you are 100% sure > is valid. > > cheers, > Richard > > On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote: > > I agree with you on that one. However, the problem might be a little > > deeper. Same '?' appear in the INSDseq format bounded by > > <INSDReference_reference> tags and cause the following exception. > > This tells me that the '?' are actually values that are being > > incorrectly parsed. Further examination of the .dtd reveals that > > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the > > files I obtain are in the INSDSeq v. 1.4 (which among other things > > contain a new tag <INSDReference_position>). Here're links to both > > .dtd's: > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt > > > > I think it might be worth accommodating changes for the INSDseq > > format, not sure how that would affect the '?' in Genbank. > > > > Seth > > > > ====================== > > org.biojava.bio.BioException: Could not read sequence > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > Caused by: org.biojava.bio.seq.io.ParseException: > > org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > > at > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line > found: ? > > at > org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) > > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) > > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) > > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) > > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) > > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) > > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) > > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) > > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) > > at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) > > at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) > > at > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) > > ... 2 more > > Java Result: -1 > > ====================== > > > > ~~~~~~~~~~~~~~~~~~~~~~ > > <INSDSeq_references> > > <INSDReference> > > <INSDReference_reference>?</INSDReference_reference> > > <INSDReference_position>1..16732</INSDReference_position> > > <INSDReference_authors> > > <INSDAuthor>Bjornerfeldt,S.</INSDAuthor> > > <INSDAuthor>Webster,M.T.</INSDAuthor> > > <INSDAuthor>Vila,C.</INSDAuthor> > > </INSDReference_authors> > > <INSDReference_title>Relaxation of Selective Constraint on Dog > > Mitochondrial DNA Following Domestication</INSDReference_title> > > <INSDReference_journal>Unpublished</INSDReference_journal> > > </INSDReference> > > <INSDReference> > > <INSDReference_reference>?</INSDReference_reference> > > <INSDReference_position>1..16732</INSDReference_position> > > <INSDReference_authors> > > <INSDAuthor>Bjornerfeldt,S.</INSDAuthor> > > <INSDAuthor>Webster,M.T.</INSDAuthor> > > <INSDAuthor>Vila,C.</INSDAuthor> > > </INSDReference_authors> > > <INSDReference_journal>Submitted (06-APR-2006) to the > > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary > > Biology, Norbyvagen 18D, Uppsala 752 36, > > Sweden</INSDReference_journal> > > </INSDReference> > > </INSDSeq_references> > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote: > > > Hmmm... interesting. I _could_ put in a special case that ignores the > > > question marks, but that wouldn't be 'nice' really - this is more of a > > > problem with the program that is producing the Genbank files than a > > > problem with the parser trying to read them. '?' is not a valid tag in > > > the official Genbank format, and has no meaning attached to it that I > > > can work out, so I'm reluctant to make the parser recognise it. > > > > > > I'd suggest you contact the people who write the software you are > using > > > to produce the Genbank files and ask them if they could stick to the > > > rules! > > > > > > In the meantime you could work around the problem by stripping the > > > question marks in some kind of pre-processor before passing it onto > > > BioJavaX for parsing. > > > > > > cheers, > > > Richard > > > > > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > > > > Removing '?' (or several of them in my case) avoids the following > exception: > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > org.biojava.bio.BioException: Could not read sequence > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at > exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > > > > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > ... 1 more > > > > Java Result: -1 > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > I don't know where that previous tokenization problem came from > since > > > > I can no longer reproduce it. This time it's more or less straight > > > > forward. > > > > Here's the original file with question marks: > > > > ============================ > > > > LOCUS DQ415957 1437 bp mRNA linear VRT > 01-JUN-2006 > > > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) > mRNA, > > > > complete cds. > > > > ACCESSION DQ415957 > > > > VERSION DQ415957.1 GI:89513612 > > > > KEYWORDS . > > > > SOURCE Unknown. > > > > ORGANISM Unknown. > > > > Unclassified. > > > > ? > > > > ? > > > > FEATURES Location/Qualifiers > > > > ? > > > > gene 1..1437 > > > > /gene="cmg2a" > > > > CDS 1..1437 > > > > /gene="cmg2a" > > > > /note="cell surface receptor; similar to > anthrax toxin > > > > receptor 2 (ANTXR2, ATR2, CMG2)" > > > > /codon_start=1 > > > > /product="capillary morphogenesis protein 2A" > > > > /protein_id="ABD74633.1" > > > > /db_xref="GI:89513613" > > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > > > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > > > > ORIGIN > > > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt > ctgtttatgc > > > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct > gtactttgtg > > > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt > tgtcaaaaat > > > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt > ttcatcaaga > > > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg > cctgaagacc > > > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa > attggcaact > > > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt > gactgatgga > > > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc > aaggaagtat > > > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct > agccgatgtg > > > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct > caaaggcatc > > > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc > gtccagcgtc > > > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt > ggggagacaa > > > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca > aaaaccaacc > > > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt > tggacagcaa > > > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc > tttcatcatc > > > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt > gctttttctc > > > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt > cgttattaaa > > > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga > cccggaaccc > > > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc > tggtggaatc > > > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc > aagactagag > > > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat > ggtcaaaaag > > > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac > accaatcaga > > > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt > ttcagttatg > > > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca > gcattaa > > > > // > > > > > > > > ============================ > > > > > > > > > > > > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote: > > > > > Hi again. > > > > > > > > > > Could you remove the offending question mark from the GenBank file > and > > > > > try it again to see if that fixes it? The parser should just > ignore it > > > > > but apparently not. The error looks weird to me because the > tokenization > > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure > what's > > > > > going on here. > > > > ... > > > > > > > > > > cheers, > > > > > Richard > > > > > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > > > > Hell again Richard, > > > > > > > > > > > > No sooner I've said about the fix of the last parsing exception > than > > > > > > another one came up with Genbank format: > > > > > > -------------------------------------- > > > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > at > exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > > > > at > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > > > > at > exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > ... 3 more > > > > > > org.biojava.bio.seq.io.ParseException: > > > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > > > > doesn't contain character: 't' > > > > > > ---------------------------------------- > > > > > > The Genbank file that caused it is as follows: > > > > > > ========================================= > > > > > > LOCUS DQ431065 425 bp DNA linear > INV 01-JUN-2006 > > > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, > partial > > > > > > sequence; mitochondrial. > > > > > > ACCESSION DQ431065 > > > > > > VERSION DQ431065.1 GI:90102206 > > > > > > KEYWORDS . > > > > > > SOURCE Vaccinium corymbosum > > > > > > ORGANISM Vaccinium corymbosum > > > > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; > Tracheophyta; > > > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core > eudicotyledons; > > > > > > asterids; Ericales; Ericaceae; Vaccinioideae; > Vaccinieae; > > > > > > Vaccinium. > > > > > > ? > > > > > > REFERENCE 2 (bases 1 to 425) > > > > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > > > > TITLE Expressed Sequence Tags of cDNA clones from > subtracted library of > > > > > > Vaccinium corymbosum > > > > > > JOURNAL Unpublished (2005) > > > > > > FEATURES Location/Qualifiers > > > > > > source 1..425 > > > > > > /organism="Vaccinium corymbosum" > > > > > > /mol_type="genomic DNA" > > > > > > /cultivar="Bluecrop" > > > > > > /db_xref="taxon:69266" > > > > > > /tissue_type="Flower buds" > > > > > > /clone_lib="Subtracted cDNA library of > Vaccinium > > > > > > corymbosum" > > > > > > /dev_stage="399 hour chill unit exposure" > > > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; > Site_2: Eco R I" > > > > > > rRNA <1..>425 > > > > > > /product="16S ribosomal RNA" > > > > > > ORIGIN > > > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct > gcccgctgac > > > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc > attagttctt > > > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt > ttgaattgtt > > > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga > gaagacccta > > > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg > gccgtctaat > > > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt > attatattta > > > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta > gggataacag > > > > > > 421 cgtaa > > > > > > // > > > > > > ================================== > > > > > > I think it's the presence of the '?' at the beginning of the > line?!?! > > > > > > I'm not sure wether the information that was supposed to be > present > > > > > > instead of those question marks is absent from the original > ASN.1 > > > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks > to me > > > > > > that the former is the case since the file from NCBI website > contains > > > > > > much more information than the batch file. Just bringing this to > > > > > > everyone's attention. > > > > > > > > > > > > > > > > > > -- > > > > > > Best Regards, > > > > > > > > > > > > > > > > > > Seth Johnson > > > > > > Senior Bioinformatics Associate > > > > > > > > > > > > Ph: (202) 470-0900 > > > > > > Fx: (775) 251-0358 > > > > > > > > > > > > On 6/2/06, Richard Holland <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Seth. > > > > > > > > > > > > > > Your second point, about the authors string not being read > correctly in > > > > > > > Genbank format, has been fixed (or should have been if I got > the code > > > > > > > right!). Could you check the latest version of biojava-live > out of CVS > > > > > > > and give it another go? Basically the parser did not recognise > the > > > > > > > CONSRTM tag, as it is not mentioned in the sample record > provided by > > > > > > > NCBI, which is what I based the parser on. > > > > > > ... > > > > > > > > > > > > > > cheers, > > > > > > > Richard > > > > > > > > > > > > > > > > > > > -- > > > > > Richard Holland (BioMart Team) > > > > > EMBL-EBI > > > > > Wellcome Trust Genome Campus > > > > > Hinxton > > > > > Cambridge CB10 1SD > > > > > UNITED KINGDOM > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
