Hello Mark,
Thank you very much for your quick reply.
However, I could not find out how to get the organism informations via
the (Rich)Annotation.
Would it be possible for you to post a piece of code showing how I could
retrieve the common name for the organism ?
Sorry for insisting, but I really need this parser for my work, and I
also really need to retrieve the organism info from the file :)
Thank you for your help,
Morgane.
[EMAIL PROTECTED] wrote:
I think these properties should be going to the (Rich)Annotation bundle.
- Mark
Morgane THOMAS-CHOLLIER <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]
02/15/2006 04:56 PM
To: biojava-l@biojava.org
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: Re: [Biojava-l] Genbank parser error [biojavax]
Hello again,
I have continued using the Genbank parser, but this time with Genbank
files coming from NCBI :)
I really appreciate the example from the documentation that converts a
Genbank file into an EMBL file. I have to say, it is really easy to use.
I nevertheless have a question concerning the Organism and Source tags.
Indeed, it is clear in the documentation that they are ignored by the
parser.
But I do not really understand why.
When I used the Genbank file of the accession numbers : AC147788 and
DQ158013, I was unable to get the common name of the organism or use
getNameHierarchy(), but I can get the taxon ID for both.
Is there a way to get the common name of the organism, without using a
remote call to the NCBI with the taxonID ?
Thanks for your help,
Morgane.
Morgane THOMAS-CHOLLIER wrote:
Hello Mark,
My file is indeed too large to be posted.
So I have exported a smaller sequence from Ensembl that I tested with
the parser. The behavior is the same.
You will find below this "Genbank" formatted file enclosed.
Thanks for your help,
Morgane.
LOCUS 6 3498 bp DNA HTG 14-FEB-2006
DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
52305503..52309000 reannotated via EnsEMBL
ACCESSION chromosome:NCBIM34:6:52305503:52309000:1
VERSION chromosome:NCBIM34:6:52305503:52309000:1
KEYWORDS .
SOURCE House mouse
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
Sciurognathi; Muridae; Murinae; Mus.
COMMENT This sequence was annotated by the Ensembl system. Please
visit the
Ensembl web site, http://www.ensembl.org/ for more
information.
COMMENT All feature locations are relative to the first (5') base
of the
sequence in this file. The sequence presented is always the
forward strand of the assembly. Features that lie outside
of the
sequence contained in this file have clonal location
coordinates in
the format: .:..
COMMENT The /gene indicates a unique id for a gene,
/note="transcript_id=..." a unique id for a transcript,
/protein_id
a unique id for a peptide and note="exon_id=..." a unique
id for an
exon. These ids are maintained wherever possible between
versions.
COMMENT All the exons and transcripts in Ensembl are confirmed by
similarity to either protein or cDNA sequences.
FEATURES Location/Qualifiers
source 1..3498
/organism="Mus musculus"
/db_xref="taxon:10090"
gene complement(506..2826)
/gene=ENSMUSG00000014704
mRNA join(complement(2261..2826),complement(506..1620))
/gene="ENSMUSG00000014704"
/note="transcript_id=ENSMUST00000014848"
CDS join(complement(2261..2639),complement(881..1620))
/gene="ENSMUSG00000014704"
/protein_id="ENSMUSP00000014848"
/note="transcript_id=ENSMUST00000014848"
/db_xref="MarkerSymbol:Hoxa2"
/db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
/db_xref="RefSeq_peptide:NP_034581.1"
/db_xref="RefSeq_dna:NM_010451.1"
/db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
/db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
/db_xref="EntrezGene:15399"
/db_xref="AgilentProbe:A_51_P501803"
/db_xref="EMBL:AB039184"
/db_xref="EMBL:AB039185"
/db_xref="EMBL:AB039186"
/db_xref="EMBL:AB039187"
/db_xref="EMBL:AB039188"
/db_xref="EMBL:AB039189"
/db_xref="EMBL:AB039190"
/db_xref="EMBL:AB039191"
/db_xref="EMBL:AB039192"
/db_xref="EMBL:AK134501"
/db_xref="EMBL:M87801"
/db_xref="EMBL:M93148"
/db_xref="EMBL:M93292"
/db_xref="EMBL:M95599"
/db_xref="GO:GO:0003700"
/db_xref="GO:GO:0005634"
/db_xref="GO:GO:0006355"
/db_xref="GO:GO:0007275"
/db_xref="IPI:IPI00132242.1"
/db_xref="UniGene:Mm.131"
/db_xref="protein_id:AAA37827.1"
/db_xref="protein_id:AAA37834.1"
/db_xref="protein_id:AAA37835.1"
/db_xref="protein_id:AAA37836.1"
/db_xref="protein_id:BAB68708.1"
/db_xref="protein_id:BAB68709.1"
/db_xref="protein_id:BAB68710.1"
/db_xref="protein_id:BAB68711.1"
/db_xref="protein_id:BAB68712.1"
/db_xref="protein_id:BAB68713.1"
/db_xref="protein_id:BAB68714.1"
/db_xref="protein_id:BAB68715.1"
/db_xref="protein_id:BAB68716.1"
/db_xref="protein_id:BAE22163.1"
/db_xref="AFFY_MG_U74Av2:102643_at"
/db_xref="AFFY_MG_U74Cv2:171063_at"
/db_xref="AFFY_Mouse430A_2:1419602_at"
/db_xref="AFFY_Mouse430_2:1419602_at"
/translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
exon complement(506..1620)
/note="exon_id=ENSMUSE00000387033"
exon complement(2261..2826)
/note="exon_id=ENSMUSE00000193269"
BASE COUNT 938 a 815 c 882 g 863 t
ORIGIN
1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA
ATTTTTGATA
61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC
ACTCCACTCG
121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG
CTTGGGCTAG
181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG
GCCTGAGTCT
241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA
AAAAAAAAAA
301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT
TTGTTGCAGG
361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG
TGACCAGACT
421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT
TGAGAAAGAG
481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA
CCAAAAATAC
541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG
ACAATTTATG
601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA
AGCTTGTTGG
661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC
TTTAAAACTG
721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG
GGTAGATCAA
781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA
CCTGGTCAAA
841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC
AGATGCTGTA
901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG
ATATCTACAG
961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC
AGGCAGGAAT
1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG
GGACTGTCAT
1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA
ACAGTGGGTG
1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA
ACTGGGAAAG
1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA
TTTTGCTGAA
1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC
TCAAAGAGTG
1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA
AATTTCCCTT
1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC
CGGTTCTGAA
1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC
ACCCTGCGGG
1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC
TGAGTGTTGG
1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT
TCCAGGGATT
1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG
GGTCCGAGCA
1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA
AATGGCCGCC
1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG
GGAAGCCCAG
1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC
ATCCGGGAGC
1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT
AGCTGAGCAA
1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA
CTAGACAAGA
1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG
AAAGTGCCCC
2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC
CCTCCACCAA
2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC
TCTCTCCCCC
2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC
CGGAGGGGGA
2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC
GAGGCAGGCA
2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT
CTTCTCCTTC
2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC
GCGACTGCCC
2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG
ACTGCCCGGG
2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG
TGAAAGCGTC
2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA
TGTCAGGCAC
2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC
GTAATTCATG
2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG
CTTTGGGGGG
2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG
AAGATCGCTG
2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG
CTACTATTAA
2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA
CATGATTGCT
2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT
GATTGATCCA
2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC
ACTTTTTTTC
3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC
GTGGGGGGCG
3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA
GTGTGTGTGT
3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG
CCTCCCCCGC
3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA
AATCATTTAA
3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT
CAAAGTTTTG
3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG
AAAGGAGCAG
3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA
GAGAGAGAGA
3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC
TCTTCCTCCT
3481 CTTTTTCCAA AATCAGTT
//
[EMAIL PROTECTED] wrote:
Hi Morgane -
I have to say that doesn't look much like Genbank : )
The biojavax parser are possibly a bit brittle due to their use of
regexps to recognize key elements. It should be fixable, I think the
problem is that the parser expects a word after LOCUS not a number.
This may not be the only problem though. Could you post the entire
file? Or if it is large then a representative file of smaller size.
- Mark
Morgane THOMAS-CHOLLIER <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]
02/14/2006 04:36 AM
To: biojava-l@biojava.org
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: [Biojava-l] Genbank parser error [biojavax]
Hello,
I have tried biojavax today with a view to use the Genbank file parser.
My test file is a Genbank formatted file which has been produced by
Ensembl export system.
The head of the file is as follow :
LOCUS 6 489671 bp DNA HTG 13-FEB-2006
DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
52296503..52786173 reannotated via EnsEMBL
ACCESSION chromosome:NCBIM34:6:52296503:52786173:1
VERSION chromosome:NCBIM34:6:52296503:52786173:1
I used the code provided in biojavax docbook to parse this file.
I get the following error :
Exception in thread "main" org.biojava.bio.BioException: Could not
read sequence
at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111)
at
org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31)
Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line
found: 6 489671 bp DNA HTG 13-FEB-2006
at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229)
at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
... 1 more
I had a look at GenbankFormat.java, and I guess the problem comes
from the regular expression that do not recognize the LOCUS as a
standard Genbank file LOCUS tag.
Am I wrong ? Have biojavax Genbank parser been tested on Ensembl
exported files ?
Morgane.
--
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student ([EMAIL PROTECTED])
Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium
Tel : +32 2 629 15 22
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !
_______________________________________________
Biojava-l mailing list - Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l