Hello Mark,

My file is indeed too large to be posted.
So I have exported a smaller sequence from Ensembl that I tested with the parser. The behavior is the same.
You will find below this "Genbank" formatted file enclosed.

Thanks for your help,

Morgane.

LOCUS       6 3498 bp DNA HTG 14-FEB-2006
DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
           52305503..52309000 reannotated via EnsEMBL
ACCESSION   chromosome:NCBIM34:6:52305503:52309000:1
VERSION     chromosome:NCBIM34:6:52305503:52309000:1
KEYWORDS    .
SOURCE      House mouse
 ORGANISM  Mus musculus
           Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
           Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
           Sciurognathi; Muridae; Murinae; Mus.
COMMENT     This sequence was annotated by the Ensembl system. Please visit the
           Ensembl web site, http://www.ensembl.org/ for more information.
COMMENT     All feature locations are relative to the first (5') base of the
           sequence in this file.  The sequence presented is always the
           forward strand of the assembly. Features that lie outside of the
           sequence contained in this file have clonal location coordinates in
           the format: .:..
COMMENT     The /gene indicates a unique id for a gene,
           /note="transcript_id=..." a unique id for a transcript, /protein_id
           a unique id for a peptide and note="exon_id=..." a unique id for an
           exon. These ids are maintained wherever possible between versions.
COMMENT     All the exons and transcripts in Ensembl are confirmed by
           similarity to either protein or cDNA sequences.
FEATURES             Location/Qualifiers
    source          1..3498
                    /organism="Mus musculus"
                    /db_xref="taxon:10090"
    gene            complement(506..2826)
                    /gene=ENSMUSG00000014704
    mRNA            join(complement(2261..2826),complement(506..1620))
                    /gene="ENSMUSG00000014704"
                    /note="transcript_id=ENSMUST00000014848"
    CDS             join(complement(2261..2639),complement(881..1620))
                    /gene="ENSMUSG00000014704"
                    /protein_id="ENSMUSP00000014848"
                    /note="transcript_id=ENSMUST00000014848"
                    /db_xref="MarkerSymbol:Hoxa2"
                    /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
                    /db_xref="RefSeq_peptide:NP_034581.1"
                    /db_xref="RefSeq_dna:NM_010451.1"
                    /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
                    /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
                    /db_xref="EntrezGene:15399"
                    /db_xref="AgilentProbe:A_51_P501803"
                    /db_xref="EMBL:AB039184"
                    /db_xref="EMBL:AB039185"
                    /db_xref="EMBL:AB039186"
                    /db_xref="EMBL:AB039187"
                    /db_xref="EMBL:AB039188"
                    /db_xref="EMBL:AB039189"
                    /db_xref="EMBL:AB039190"
                    /db_xref="EMBL:AB039191"
                    /db_xref="EMBL:AB039192"
                    /db_xref="EMBL:AK134501"
                    /db_xref="EMBL:M87801"
                    /db_xref="EMBL:M93148"
                    /db_xref="EMBL:M93292"
                    /db_xref="EMBL:M95599"
                    /db_xref="GO:GO:0003700"
                    /db_xref="GO:GO:0005634"
                    /db_xref="GO:GO:0006355"
                    /db_xref="GO:GO:0007275"
                    /db_xref="IPI:IPI00132242.1"
                    /db_xref="UniGene:Mm.131"
                    /db_xref="protein_id:AAA37827.1"
                    /db_xref="protein_id:AAA37834.1"
                    /db_xref="protein_id:AAA37835.1"
                    /db_xref="protein_id:AAA37836.1"
                    /db_xref="protein_id:BAB68708.1"
                    /db_xref="protein_id:BAB68709.1"
                    /db_xref="protein_id:BAB68710.1"
                    /db_xref="protein_id:BAB68711.1"
                    /db_xref="protein_id:BAB68712.1"
                    /db_xref="protein_id:BAB68713.1"
                    /db_xref="protein_id:BAB68714.1"
                    /db_xref="protein_id:BAB68715.1"
                    /db_xref="protein_id:BAB68716.1"
                    /db_xref="protein_id:BAE22163.1"
                    /db_xref="AFFY_MG_U74Av2:102643_at"
                    /db_xref="AFFY_MG_U74Cv2:171063_at"
                    /db_xref="AFFY_Mouse430A_2:1419602_at"
                    /db_xref="AFFY_Mouse430_2:1419602_at"
                    /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
                    STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
                    KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
                    NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
                    VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
                    EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
                    ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
    exon            complement(506..1620)
                    /note="exon_id=ENSMUSE00000387033"
    exon            complement(2261..2826)
                    /note="exon_id=ENSMUSE00000193269"
BASE COUNT  938 a 815 c 882 g 863 t
ORIGIN
       1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA ATTTTTGATA
      61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC ACTCCACTCG
     121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG CTTGGGCTAG
     181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG GCCTGAGTCT
     241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA AAAAAAAAAA
     301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT TTGTTGCAGG
     361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG TGACCAGACT
     421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT TGAGAAAGAG
     481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA CCAAAAATAC
     541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG ACAATTTATG
     601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA AGCTTGTTGG
     661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC TTTAAAACTG
     721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG GGTAGATCAA
     781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA CCTGGTCAAA
     841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC AGATGCTGTA
     901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG ATATCTACAG
     961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC AGGCAGGAAT
    1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG GGACTGTCAT
    1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA ACAGTGGGTG
    1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA ACTGGGAAAG
    1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA TTTTGCTGAA
    1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC TCAAAGAGTG
    1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA AATTTCCCTT
    1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC CGGTTCTGAA
    1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC ACCCTGCGGG
    1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC TGAGTGTTGG
    1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT TCCAGGGATT
    1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG GGTCCGAGCA
    1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA AATGGCCGCC
    1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG GGAAGCCCAG
    1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC ATCCGGGAGC
    1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT AGCTGAGCAA
    1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA CTAGACAAGA
    1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG AAAGTGCCCC
    2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC CCTCCACCAA
    2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC TCTCTCCCCC
    2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC CGGAGGGGGA
    2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC GAGGCAGGCA
    2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT CTTCTCCTTC
    2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC GCGACTGCCC
    2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG ACTGCCCGGG
    2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG TGAAAGCGTC
    2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA TGTCAGGCAC
    2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC GTAATTCATG
    2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG CTTTGGGGGG
    2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG AAGATCGCTG
    2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG CTACTATTAA
    2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA CATGATTGCT
    2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT GATTGATCCA
    2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC ACTTTTTTTC
    3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC GTGGGGGGCG
    3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA GTGTGTGTGT
    3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG CCTCCCCCGC
    3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA AATCATTTAA
    3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT CAAAGTTTTG
    3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG AAAGGAGCAG
    3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA GAGAGAGAGA
    3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC TCTTCCTCCT
    3481 CTTTTTCCAA AATCAGTT
//




[EMAIL PROTECTED] wrote:

Hi Morgane -

I have to say that doesn't look much like Genbank : )

The biojavax parser are possibly a bit brittle due to their use of regexps to recognize key elements. It should be fixable, I think the problem is that the parser expects a word after LOCUS not a number. This may not be the only problem though. Could you post the entire file? Or if it is large then a representative file of smaller size.

- Mark





Morgane THOMAS-CHOLLIER <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]
02/14/2006 04:36 AM


       To:     biojava-l@biojava.org
       cc:     (bcc: Mark Schreiber/GP/Novartis)
       Subject:        [Biojava-l] Genbank  parser error [biojavax]


Hello,

I have tried biojavax today with a view to use the Genbank file parser.

My test file is a Genbank formatted file which has been produced by Ensembl export system.

The head of the file is as follow :

LOCUS       6 489671 bp DNA HTG 13-FEB-2006
DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
           52296503..52786173 reannotated via EnsEMBL
ACCESSION   chromosome:NCBIM34:6:52296503:52786173:1
VERSION     chromosome:NCBIM34:6:52296503:52786173:1

I used the code provided in biojavax docbook to parse this file.
I get the following error :

Exception in thread "main" org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111) at org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31) Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line found: 6 489671 bp DNA HTG 13-FEB-2006 at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
   ... 1 more

I had a look at GenbankFormat.java, and I guess the problem comes from the regular expression that do not recognize the LOCUS as a standard Genbank file LOCUS tag.

Am I wrong ? Have biojavax Genbank parser been tested on Ensembl exported files ?

Morgane.


--
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student ([EMAIL PROTECTED])

Vrije Universiteit Brussels (VUB) Laboratory of Cell Genetics Pleinlaan 2 1050 Brussels Belgium Tel : +32 2 629 15 22 **********************************************************
Stop Using Internet Explorer, choose FIREFOX !
http://emmanuel.clement.free.fr/navigateurs/comparatif.htm

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l

Reply via email to