Hi Mark, Thank you for your suggestions. I've followed your suggestions and it seems to have found a bug that caused an exception in readINSDseqDNA parser.
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=94481355 The problem int the above sequence in INSDseq format was caused by the presence of <INSDQualifier_name> tags without the corresponding <INSDQualifier_value> tags: <INSDQualifier> <INSDQualifier_name>environmental_sample</INSDQualifier_name> </INSDQualifier> I have not checked wether it's handled correctly by other parsers when it is converted from original NCBI ASN.1 format. Could the code be adjusted so if there's no <INSDQualifier_value> tags it would assume the value to be 'null' ??? Regards, Seth On 6/1/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hi Seth - > > The BioJavaX parsers are still quite new and have not been heavily tested > so your experiences can help us quite a lot. The parsers where initially > designed to be quite strict and follow the GenBank etc specifications. > However, there are often minor variations to those specs which cause > things to break. > > To help us find the bugs can you make sure you are using the very latest > version of biojava from CVS, for example I was under the impression that > the author = null problem had been solved. In each case an example file > and the full stack trace is very useful as well. In some cases you have > provided these so we have a starting point. > > Also, if you have ideas on ways to fix the problems your suggestions would > be greatly appreciated. We only have a very small team of active > developers many of whom are unfortunately very busy just now. > > Hopefully we can get to this soon. > > - Mark > > > > > > "Seth Johnson" <[EMAIL PROTECTED]> > Sent by: [EMAIL PROTECTED] > 06/02/2006 06:03 AM > > > To: [email protected] > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from > binary NCBI ASN.1 > daily update files > > > Hi All, > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > clarification on several issues that I'm having. > I am developing a parser that would take as input "NCBI Incremental > ASN.1 Sequence Updates to Genbank" files ( > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > ASN2GB converter ( > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > resulting sequences to a format parsable by BioJava(X) ( > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > my problems start. > > ISSUE 1: > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > format is recognized by the > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > some exceptions that I'll describe in issue #2. This is the code that > I'm using to parse, for example, the EMBL output: > > BufferedReader inBuf = new BufferedReader(new > FileReader("embl_output.emb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > The multi-sequence EMBL file looks like this: > --------------------------------------------------------------------------------- > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) > gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do > Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, > RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt > 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc > 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg > 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat > 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg > 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca > 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag > 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct > 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) > gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do > Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, > RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt > 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata > 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc > 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt > 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga > 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt > 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg > 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca > 540 > tga 543 > // > ----------------------------------------------------------------------- > I get an exception message "Could Not Read Sequence". Same thing > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > with the following INSDset file (beginning of the file): > > <?xml version="1.0"?> > <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd"> > <INSDSeq> > <INSDSeq_locus>DQ022078</INSDSeq_locus> > <INSDSeq_length>16729</INSDSeq_length> > <INSDSeq_moltype>DNA</INSDSeq_moltype> > <INSDSeq_topology>linear</INSDSeq_topology> > <INSDSeq_division>ENV</INSDSeq_division> > <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date> > <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date> > <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > class C (estA3), putative permease (a3.005), putative transmembrane > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > protein (a3.012), putative membrane protease subunit (a3.013), > putative haloalkane dehalogenase (a3.014), putative transcriptional > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition> > <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession> > <INSDSeq_other-seqids> > <INSDSeqid>gb|DQ022078.1|</INSDSeqid> > <INSDSeqid>gi|71842722</INSDSeqid> > </INSDSeq_other-seqids> > <INSDSeq_keywords> > <INSDKeyword>ENV</INSDKeyword> > </INSDSeq_keywords> > <INSDSeq_references> > <INSDReference> > <INSDReference_reference>?</INSDReference_reference> > <INSDReference_position>1..16729</INSDReference_position> > <INSDReference_authors> > <INSDAuthor>Schmeisser,C.</INSDAuthor> > <INSDAuthor>Elend,C.</INSDAuthor> > <INSDAuthor>Streit,W.R.</INSDAuthor> > </INSDReference_authors> > <INSDReference_title>Isolation and biochemical characterization > of two novel metagenome derived esterases</INSDReference_title> > <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0 > (2006)</INSDReference_journal> > </INSDReference> > <INSDReference> > <INSDReference_reference>?</INSDReference_reference> > <INSDReference_position>1..16729</INSDReference_position> > <INSDReference_authors> > <INSDAuthor>Schmeisser,C.</INSDAuthor> > <INSDAuthor>Elend,C.</INSDAuthor> > <INSDAuthor>Streit,W.R.</INSDAuthor> > </INSDReference_authors> > <INSDReference_journal>Submitted (29-APR-2005) to the > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > Germany</INSDReference_journal> > </INSDReference> > </INSDSeq_references> > > So my question is wether the ASN2GB produces output that's > incompatible with BioJava parsers or is there a problem with the > sequence themselves or the problems with the majority of parsers??? > Could it be that I'm using the API wrongly for the above formats, > although GenBank parser works as advertised with some exceptions > below: > > ISSUE #2: > When I try to parse GenBank files using the following code: > > BufferedReader inBuf = new BufferedReader(new > FileReader("genbank_output.gb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > Genbank file in question: > > LOCUS BC074905 838 bp mRNA linear PRI > 15-APR-2006 > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > IMAGE:30915482), complete cds. > ACCESSION BC074905 > VERSION BC074905.2 GI:50959825 > KEYWORDS MGC. > SOURCE Homo sapiens (human) > ORGANISM Homo sapiens > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > Catarrhini; Hominidae; Homo. > REFERENCE 1 (bases 1 to 838) > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., > Schuler,G.D., > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., > Bhat,N.K., > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., > Hsieh,F., > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., > Peters,G.J., > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., > Myers,R.M., > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > CONSRTM Mammalian Gene Collection Program Team > TITLE Generation and initial analysis of more than 15,000 > full-length > human and mouse cDNA sequences > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > PUBMED 12477932 > REFERENCE 2 (bases 1 to 838) > CONSRTM NIH MGC Project > TITLE Direct Submission > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, > Mammalian > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > Contact: MGC help desk > Email: [EMAIL PROTECTED] > Tissue Procurement: Genome Sequence Centre, British Columbia > Cancer > Center > cDNA Library Preparation: British Columbia Cancer Research > Center > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > DNA Sequencing by: Genome Sequence Centre, > BC Cancer Agency, Vancouver, BC, Canada > [EMAIL PROTECTED] > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, > Ruth > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy > Liao, > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco > Marra. > > Clone distribution: MGC clone distribution information can be > found > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > Series: IRBU Plate: 4 Row: C Column: 3. > > Differences found between this sequence and the human > reference > genome (build 36) are described in misc_difference features > below. > FEATURES Location/Qualifiers > source 1..838 > /organism="Homo sapiens" > /mol_type="mRNA" > /db_xref="taxon:9606" > /clone="MGC:104038 IMAGE:30915482" > /tissue_type="Lung, PCR rescued clones" > /clone_lib="NIH_MGC_273" > /lab_host="DH10B" > /note="Vector: pCR4 Topo TA with reversed insert" > gene 1..838 > /gene="KLK14" > /note="synonym: KLK-L6" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > CDS 49..804 > /gene="KLK14" > /codon_start=1 > /product="KLK14 protein" > /protein_id="AAH74905.1" > /db_xref="GI:50959826" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > misc_difference 98 > /gene="KLK14" > /note="'G' in cDNA is 'A' in the human genome; amino > acid > difference: 'R' in cDNA, 'Q' in the human genome." > misc_difference 133 > /gene="KLK14" > /note="'T' in cDNA is 'C' in the human genome; amino > acid > difference: 'Y' in cDNA, 'H' in the human genome." > ORIGIN > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat > gttcctcctg > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga > tgagaacaag > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc > cctgctggcg > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg > ggtcatcact > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa > cctgaggagg > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc > caactacaac > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc > acggatcggg > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac > ctcctgccga > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc > tctgcaatgc > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag > aaccatcacg > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca > gggtgactct > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg > aatggagcgc > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag > aagctggatt > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > // > > I get the following exception: > > java.lang.IllegalArgumentException: Authors string cannot be null > org.biojava.bio.BioException: Could not read sequence > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at > exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > at > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > Caused by: java.lang.IllegalArgumentException: Authors string cannot be > null > at > org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ----------------------------------------------------------------------- > > I'm trying to see what could be the problem with this particular > sequence. Looks to me like the AUTHORS portion is not getting parsed > correctly. Any ideas would be greatly appreciated! > > -- > Best Regards, > > > Seth Johnson > Senior Bioinformatics Associate > > Ph: (202) 470-0900 > Fx: (775) 251-0358 > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
