Hi Morgane - Turned out to be a problem with a greedy regexp parsing the LOCUS tag. This is fixed in CVS. Let me know if something else is a problem.
- Mark Morgane THOMAS-CHOLLIER <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 02/14/2006 09:33 PM To: biojava-l@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] Genbank parser error [biojavax] Hello Mark, My file is indeed too large to be posted. So I have exported a smaller sequence from Ensembl that I tested with the parser. The behavior is the same. You will find below this "Genbank" formatted file enclosed. Thanks for your help, Morgane. LOCUS 6 3498 bp DNA HTG 14-FEB-2006 DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence 52305503..52309000 reannotated via EnsEMBL ACCESSION chromosome:NCBIM34:6:52305503:52309000:1 VERSION chromosome:NCBIM34:6:52305503:52309000:1 KEYWORDS . SOURCE House mouse ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus. COMMENT This sequence was annotated by the Ensembl system. Please visit the Ensembl web site, http://www.ensembl.org/ for more information. COMMENT All feature locations are relative to the first (5') base of the sequence in this file. The sequence presented is always the forward strand of the assembly. Features that lie outside of the sequence contained in this file have clonal location coordinates in the format: .:.. COMMENT The /gene indicates a unique id for a gene, /note="transcript_id=..." a unique id for a transcript, /protein_id a unique id for a peptide and note="exon_id=..." a unique id for an exon. These ids are maintained wherever possible between versions. COMMENT All the exons and transcripts in Ensembl are confirmed by similarity to either protein or cDNA sequences. FEATURES Location/Qualifiers source 1..3498 /organism="Mus musculus" /db_xref="taxon:10090" gene complement(506..2826) /gene=ENSMUSG00000014704 mRNA join(complement(2261..2826),complement(506..1620)) /gene="ENSMUSG00000014704" /note="transcript_id=ENSMUST00000014848" CDS join(complement(2261..2639),complement(881..1620)) /gene="ENSMUSG00000014704" /protein_id="ENSMUSP00000014848" /note="transcript_id=ENSMUST00000014848" /db_xref="MarkerSymbol:Hoxa2" /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE" /db_xref="RefSeq_peptide:NP_034581.1" /db_xref="RefSeq_dna:NM_010451.1" /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE" /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE" /db_xref="EntrezGene:15399" /db_xref="AgilentProbe:A_51_P501803" /db_xref="EMBL:AB039184" /db_xref="EMBL:AB039185" /db_xref="EMBL:AB039186" /db_xref="EMBL:AB039187" /db_xref="EMBL:AB039188" /db_xref="EMBL:AB039189" /db_xref="EMBL:AB039190" /db_xref="EMBL:AB039191" /db_xref="EMBL:AB039192" /db_xref="EMBL:AK134501" /db_xref="EMBL:M87801" /db_xref="EMBL:M93148" /db_xref="EMBL:M93292" /db_xref="EMBL:M95599" /db_xref="GO:GO:0003700" /db_xref="GO:GO:0005634" /db_xref="GO:GO:0006355" /db_xref="GO:GO:0007275" /db_xref="IPI:IPI00132242.1" /db_xref="UniGene:Mm.131" /db_xref="protein_id:AAA37827.1" /db_xref="protein_id:AAA37834.1" /db_xref="protein_id:AAA37835.1" /db_xref="protein_id:AAA37836.1" /db_xref="protein_id:BAB68708.1" /db_xref="protein_id:BAB68709.1" /db_xref="protein_id:BAB68710.1" /db_xref="protein_id:BAB68711.1" /db_xref="protein_id:BAB68712.1" /db_xref="protein_id:BAB68713.1" /db_xref="protein_id:BAB68714.1" /db_xref="protein_id:BAB68715.1" /db_xref="protein_id:BAB68716.1" /db_xref="protein_id:BAE22163.1" /db_xref="AFFY_MG_U74Av2:102643_at" /db_xref="AFFY_MG_U74Cv2:171063_at" /db_xref="AFFY_Mouse430A_2:1419602_at" /db_xref="AFFY_Mouse430_2:1419602_at" /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY" exon complement(506..1620) /note="exon_id=ENSMUSE00000387033" exon complement(2261..2826) /note="exon_id=ENSMUSE00000193269" BASE COUNT 938 a 815 c 882 g 863 t ORIGIN 1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA ATTTTTGATA 61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC ACTCCACTCG 121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG CTTGGGCTAG 181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG GCCTGAGTCT 241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA AAAAAAAAAA 301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT TTGTTGCAGG 361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG TGACCAGACT 421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT TGAGAAAGAG 481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA CCAAAAATAC 541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG ACAATTTATG 601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA AGCTTGTTGG 661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC TTTAAAACTG 721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG GGTAGATCAA 781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA CCTGGTCAAA 841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC AGATGCTGTA 901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG ATATCTACAG 961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC AGGCAGGAAT 1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG GGACTGTCAT 1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA ACAGTGGGTG 1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA ACTGGGAAAG 1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA TTTTGCTGAA 1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC TCAAAGAGTG 1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA AATTTCCCTT 1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC CGGTTCTGAA 1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC ACCCTGCGGG 1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC TGAGTGTTGG 1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT TCCAGGGATT 1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG GGTCCGAGCA 1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA AATGGCCGCC 1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG GGAAGCCCAG 1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC ATCCGGGAGC 1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT AGCTGAGCAA 1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA CTAGACAAGA 1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG AAAGTGCCCC 2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC CCTCCACCAA 2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC TCTCTCCCCC 2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC CGGAGGGGGA 2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC GAGGCAGGCA 2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT CTTCTCCTTC 2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC GCGACTGCCC 2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG ACTGCCCGGG 2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG TGAAAGCGTC 2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA TGTCAGGCAC 2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC GTAATTCATG 2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG CTTTGGGGGG 2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG AAGATCGCTG 2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG CTACTATTAA 2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA CATGATTGCT 2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT GATTGATCCA 2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC ACTTTTTTTC 3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC GTGGGGGGCG 3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA GTGTGTGTGT 3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG CCTCCCCCGC 3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA AATCATTTAA 3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT CAAAGTTTTG 3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG AAAGGAGCAG 3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA GAGAGAGAGA 3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC TCTTCCTCCT 3481 CTTTTTCCAA AATCAGTT // [EMAIL PROTECTED] wrote: >Hi Morgane - > >I have to say that doesn't look much like Genbank : ) > >The biojavax parser are possibly a bit brittle due to their use of regexps >to recognize key elements. It should be fixable, I think the problem is >that the parser expects a word after LOCUS not a number. This may not be >the only problem though. Could you post the entire file? Or if it is large >then a representative file of smaller size. > >- Mark > > > > > >Morgane THOMAS-CHOLLIER <[EMAIL PROTECTED]> >Sent by: [EMAIL PROTECTED] >02/14/2006 04:36 AM > > > To: biojava-l@biojava.org > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-l] Genbank parser error [biojavax] > > >Hello, > >I have tried biojavax today with a view to use the Genbank file parser. > >My test file is a Genbank formatted file which has been produced by >Ensembl export system. > >The head of the file is as follow : > >LOCUS 6 489671 bp DNA HTG 13-FEB-2006 >DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence > 52296503..52786173 reannotated via EnsEMBL >ACCESSION chromosome:NCBIM34:6:52296503:52786173:1 >VERSION chromosome:NCBIM34:6:52296503:52786173:1 > >I used the code provided in biojavax docbook to parse this file. >I get the following error : > >Exception in thread "main" org.biojava.bio.BioException: Could not read >sequence > at >org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111) > at >org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31) >Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line found: >6 489671 bp DNA HTG 13-FEB-2006 > at >org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229) > at >org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108) > ... 1 more > >I had a look at GenbankFormat.java, and I guess the problem comes from >the regular expression that do not recognize the LOCUS as a standard >Genbank file LOCUS tag. > >Am I wrong ? Have biojavax Genbank parser been tested on Ensembl >exported files ? > >Morgane. > > > -- ********************************************************** Morgane THOMAS-CHOLLIER, PHD Student ([EMAIL PROTECTED]) Vrije Universiteit Brussels (VUB) Laboratory of Cell Genetics Pleinlaan 2 1050 Brussels Belgium Tel : +32 2 629 15 22 ********************************************************** Stop Using Internet Explorer, choose FIREFOX ! http://emmanuel.clement.free.fr/navigateurs/comparatif.htm _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l