I think these properties should be going to the (Rich)Annotation bundle. - Mark
Morgane THOMAS-CHOLLIER <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 02/15/2006 04:56 PM To: biojava-l@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] Genbank parser error [biojavax] Hello again, I have continued using the Genbank parser, but this time with Genbank files coming from NCBI :) I really appreciate the example from the documentation that converts a Genbank file into an EMBL file. I have to say, it is really easy to use. I nevertheless have a question concerning the Organism and Source tags. Indeed, it is clear in the documentation that they are ignored by the parser. But I do not really understand why. When I used the Genbank file of the accession numbers : AC147788 and DQ158013, I was unable to get the common name of the organism or use getNameHierarchy(), but I can get the taxon ID for both. Is there a way to get the common name of the organism, without using a remote call to the NCBI with the taxonID ? Thanks for your help, Morgane. Morgane THOMAS-CHOLLIER wrote: > Hello Mark, > > My file is indeed too large to be posted. > So I have exported a smaller sequence from Ensembl that I tested with > the parser. The behavior is the same. > You will find below this "Genbank" formatted file enclosed. > > Thanks for your help, > > Morgane. > > LOCUS 6 3498 bp DNA HTG 14-FEB-2006 > DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence > 52305503..52309000 reannotated via EnsEMBL > ACCESSION chromosome:NCBIM34:6:52305503:52309000:1 > VERSION chromosome:NCBIM34:6:52305503:52309000:1 > KEYWORDS . > SOURCE House mouse > ORGANISM Mus musculus > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; > Sciurognathi; Muridae; Murinae; Mus. > COMMENT This sequence was annotated by the Ensembl system. Please > visit the > Ensembl web site, http://www.ensembl.org/ for more > information. > COMMENT All feature locations are relative to the first (5') base > of the > sequence in this file. The sequence presented is always the > forward strand of the assembly. Features that lie outside > of the > sequence contained in this file have clonal location > coordinates in > the format: .:.. > COMMENT The /gene indicates a unique id for a gene, > /note="transcript_id=..." a unique id for a transcript, > /protein_id > a unique id for a peptide and note="exon_id=..." a unique > id for an > exon. These ids are maintained wherever possible between > versions. > COMMENT All the exons and transcripts in Ensembl are confirmed by > similarity to either protein or cDNA sequences. > FEATURES Location/Qualifiers > source 1..3498 > /organism="Mus musculus" > /db_xref="taxon:10090" > gene complement(506..2826) > /gene=ENSMUSG00000014704 > mRNA join(complement(2261..2826),complement(506..1620)) > /gene="ENSMUSG00000014704" > /note="transcript_id=ENSMUST00000014848" > CDS join(complement(2261..2639),complement(881..1620)) > /gene="ENSMUSG00000014704" > /protein_id="ENSMUSP00000014848" > /note="transcript_id=ENSMUST00000014848" > /db_xref="MarkerSymbol:Hoxa2" > /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE" > /db_xref="RefSeq_peptide:NP_034581.1" > /db_xref="RefSeq_dna:NM_010451.1" > /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE" > /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE" > /db_xref="EntrezGene:15399" > /db_xref="AgilentProbe:A_51_P501803" > /db_xref="EMBL:AB039184" > /db_xref="EMBL:AB039185" > /db_xref="EMBL:AB039186" > /db_xref="EMBL:AB039187" > /db_xref="EMBL:AB039188" > /db_xref="EMBL:AB039189" > /db_xref="EMBL:AB039190" > /db_xref="EMBL:AB039191" > /db_xref="EMBL:AB039192" > /db_xref="EMBL:AK134501" > /db_xref="EMBL:M87801" > /db_xref="EMBL:M93148" > /db_xref="EMBL:M93292" > /db_xref="EMBL:M95599" > /db_xref="GO:GO:0003700" > /db_xref="GO:GO:0005634" > /db_xref="GO:GO:0006355" > /db_xref="GO:GO:0007275" > /db_xref="IPI:IPI00132242.1" > /db_xref="UniGene:Mm.131" > /db_xref="protein_id:AAA37827.1" > /db_xref="protein_id:AAA37834.1" > /db_xref="protein_id:AAA37835.1" > /db_xref="protein_id:AAA37836.1" > /db_xref="protein_id:BAB68708.1" > /db_xref="protein_id:BAB68709.1" > /db_xref="protein_id:BAB68710.1" > /db_xref="protein_id:BAB68711.1" > /db_xref="protein_id:BAB68712.1" > /db_xref="protein_id:BAB68713.1" > /db_xref="protein_id:BAB68714.1" > /db_xref="protein_id:BAB68715.1" > /db_xref="protein_id:BAB68716.1" > /db_xref="protein_id:BAE22163.1" > /db_xref="AFFY_MG_U74Av2:102643_at" > /db_xref="AFFY_MG_U74Cv2:171063_at" > /db_xref="AFFY_Mouse430A_2:1419602_at" > /db_xref="AFFY_Mouse430_2:1419602_at" > > /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH > > STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE > > KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF > > NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK > > VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN > > EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD > ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY" > exon complement(506..1620) > /note="exon_id=ENSMUSE00000387033" > exon complement(2261..2826) > /note="exon_id=ENSMUSE00000193269" > BASE COUNT 938 a 815 c 882 g 863 t > ORIGIN > 1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA > ATTTTTGATA > 61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC > ACTCCACTCG > 121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG > CTTGGGCTAG > 181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG > GCCTGAGTCT > 241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA > AAAAAAAAAA > 301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT > TTGTTGCAGG > 361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG > TGACCAGACT > 421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT > TGAGAAAGAG > 481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA > CCAAAAATAC > 541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG > ACAATTTATG > 601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA > AGCTTGTTGG > 661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC > TTTAAAACTG > 721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG > GGTAGATCAA > 781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA > CCTGGTCAAA > 841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC > AGATGCTGTA > 901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG > ATATCTACAG > 961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC > AGGCAGGAAT > 1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG > GGACTGTCAT > 1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA > ACAGTGGGTG > 1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA > ACTGGGAAAG > 1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA > TTTTGCTGAA > 1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC > TCAAAGAGTG > 1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA > AATTTCCCTT > 1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC > CGGTTCTGAA > 1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC > ACCCTGCGGG > 1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC > TGAGTGTTGG > 1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT > TCCAGGGATT > 1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG > GGTCCGAGCA > 1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA > AATGGCCGCC > 1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG > GGAAGCCCAG > 1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC > ATCCGGGAGC > 1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT > AGCTGAGCAA > 1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA > CTAGACAAGA > 1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG > AAAGTGCCCC > 2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC > CCTCCACCAA > 2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC > TCTCTCCCCC > 2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC > CGGAGGGGGA > 2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC > GAGGCAGGCA > 2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT > CTTCTCCTTC > 2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC > GCGACTGCCC > 2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG > ACTGCCCGGG > 2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG > TGAAAGCGTC > 2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA > TGTCAGGCAC > 2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC > GTAATTCATG > 2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG > CTTTGGGGGG > 2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG > AAGATCGCTG > 2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG > CTACTATTAA > 2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA > CATGATTGCT > 2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT > GATTGATCCA > 2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC > ACTTTTTTTC > 3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC > GTGGGGGGCG > 3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA > GTGTGTGTGT > 3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG > CCTCCCCCGC > 3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA > AATCATTTAA > 3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT > CAAAGTTTTG > 3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG > AAAGGAGCAG > 3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA > GAGAGAGAGA > 3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC > TCTTCCTCCT > 3481 CTTTTTCCAA AATCAGTT > // > > > > > [EMAIL PROTECTED] wrote: > >> Hi Morgane - >> >> I have to say that doesn't look much like Genbank : ) >> >> The biojavax parser are possibly a bit brittle due to their use of >> regexps to recognize key elements. It should be fixable, I think the >> problem is that the parser expects a word after LOCUS not a number. >> This may not be the only problem though. Could you post the entire >> file? Or if it is large then a representative file of smaller size. >> >> - Mark >> >> >> >> >> >> Morgane THOMAS-CHOLLIER <[EMAIL PROTECTED]> >> Sent by: [EMAIL PROTECTED] >> 02/14/2006 04:36 AM >> >> >> To: biojava-l@biojava.org >> cc: (bcc: Mark Schreiber/GP/Novartis) >> Subject: [Biojava-l] Genbank parser error [biojavax] >> >> >> Hello, >> >> I have tried biojavax today with a view to use the Genbank file parser. >> >> My test file is a Genbank formatted file which has been produced by >> Ensembl export system. >> >> The head of the file is as follow : >> >> LOCUS 6 489671 bp DNA HTG 13-FEB-2006 >> DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence >> 52296503..52786173 reannotated via EnsEMBL >> ACCESSION chromosome:NCBIM34:6:52296503:52786173:1 >> VERSION chromosome:NCBIM34:6:52296503:52786173:1 >> >> I used the code provided in biojavax docbook to parse this file. >> I get the following error : >> >> Exception in thread "main" org.biojava.bio.BioException: Could not >> read sequence >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111) >> >> at >> org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31) >> >> Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line >> found: 6 489671 bp DNA HTG 13-FEB-2006 >> at >> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229) >> >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108) >> >> ... 1 more >> >> I had a look at GenbankFormat.java, and I guess the problem comes >> from the regular expression that do not recognize the LOCUS as a >> standard Genbank file LOCUS tag. >> >> Am I wrong ? Have biojavax Genbank parser been tested on Ensembl >> exported files ? >> >> Morgane. >> >> >> > -- ********************************************************** Morgane THOMAS-CHOLLIER, PHD Student ([EMAIL PROTECTED]) Vrije Universiteit Brussels (VUB) Laboratory of Cell Genetics Pleinlaan 2 1050 Brussels Belgium _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l