I see now! It looks like the ASN2GB converter is taking some liberties
with EMBL format. I'll try to experiment with command line options of
that software and if all else fails get hold of the NCBI developers.
On 6/6/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> The program used to generate that EMBL file is doing it incorrectly - it
> is missing the XX tag after the feature table, and is also missing the
> SQ tag before the sequence begins.
>
> If you generated it using BJX then that's my problem to fix so let me
> know ASAP if that is the case!
>
> cheers,
> Richard
>
> On Mon, 2006-06-05 at 11:53 -0400, Seth Johnson wrote:
> > :) I got another one for you:
> > =========================
> > org.biojava.bio.BioException: Could not read sequence
> > at
> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > Caused by: java.lang.StringIndexOutOfBoundsException: String index out
> > of range: -3
> > at java.lang.String.substring(String.java:1768)
> > at java.lang.String.substring(String.java:1735)
> > at
> > org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:672)
> > at
> > org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:281)
> > at
> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > ... 1 more
> > Java Result: -1
> > =========================
> > File used to produce the above:
> > ~~~~~~~~~~~~~~~~~~~~~~~~~
> > ID DQ472184 standard; DNA; INV; 546 BP.
> > XX
> > AC DQ472184;
> > XX
> > SV DQ472184.1
> > DT 15-MAY-2006
> > XX
> > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21)
> > gene,
> > DE complete cds.
> > XX
> > KW .
> > XX
> > OS Trypanosoma cruzi strain CL Brener
> > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > OC Schizotrypanum.
> > XX
> > RN [1]
> > RP 1-546
> > RA De Melo L.D.B.;
> > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > RL Unpublished.
> > XX
> > RN [2]
> > RP 1-546
> > RA De Melo L.D.B.;
> > RT ;
> > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro,
> > RJ
> > RL 21949-900, Brazil
> > XX
> > FH Key Location/Qualifiers
> > FH
> > FT source 1..546
> > FT /organism="Trypanosoma cruzi strain CL Brener"
> > FT /mol_type="genomic DNA"
> > FT /strain="CL Brener"
> > FT /db_xref="taxon:353153"
> > FT gene <1..>546
> > FT /gene="ARC21"
> > FT /note="TcARC21"
> > FT mRNA <1..>546
> > FT /gene="ARC21"
> > FT /product="actin-related protein 3"
> > FT CDS 1..546
> > FT /gene="ARC21"
> > FT /note="actin-binding protein; ARPC3 21 kDa; putative
> > FT member of Arp2/3 complex"
> > FT /codon_start=1
> > FT /product="actin-related protein 3"
> > FT /protein_id="ABF13401.1"
> > FT /db_xref="GI:93360014"
> > FT
> > /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > FT
> > EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > FT
> > SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > FT FPEKDGTGNKFWMAFAKRPFLASS"
> > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg
> > 60
> > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt
> > 120
> > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc
> > 180
> > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg
> > 240
> > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat
> > 300
> > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg
> > 360
> > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca
> > 420
> > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag
> > 480
> > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct
> > 540
> > agttag
> > 546
> > //
> > ID DQ472185 standard; DNA; INV; 543 BP.
> > XX
> > AC DQ472185;
> > XX
> > SV DQ472185.1
> > DT 15-MAY-2006
> > XX
> > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20)
> > gene,
> > DE complete cds.
> > XX
> > KW .
> > XX
> > OS Trypanosoma cruzi strain CL Brener
> > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > OC Schizotrypanum.
> > XX
> > RN [1]
> > RP 1-543
> > RA De Melo L.D.B.;
> > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > RL Unpublished.
> > XX
> > RN [2]
> > RP 1-543
> > RA De Melo L.D.B.;
> > RT ;
> > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro,
> > RJ
> > RL 21949-900, Brazil
> > XX
> > FH Key Location/Qualifiers
> > FH
> > FT source 1..543
> > FT /organism="Trypanosoma cruzi strain CL Brener"
> > FT /mol_type="genomic DNA"
> > FT /strain="CL Brener"
> > FT /db_xref="taxon:353153"
> > FT gene <1..>543
> > FT /gene="ARC20"
> > FT /note="TcARC20"
> > FT mRNA <1..>543
> > FT /gene="ARC20"
> > FT /product="actin-related protein 4"
> > FT CDS 1..543
> > FT /gene="ARC20"
> > FT /note="actin-binding protein; ARPC4 20 kDa; putative
> > FT member of Arp2/3 complex"
> > FT /codon_start=1
> > FT /product="actin-related protein 4"
> > FT /protein_id="ABF13402.1"
> > FT /db_xref="GI:93360016"
> > FT
> > /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > FT
> > LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > FT
> > GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > FT MKLNVNQRARRAAMEFFLALNFT"
> > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg
> > 60
> > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt
> > 120
> > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata
> > 180
> > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc
> > 240
> > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt
> > 300
> > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga
> > 360
> > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt
> > 420
> > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg
> > 480
> > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca
> > 540
> > tga
> > 543
> > //
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> > > Doh!
> > >
> > > I am in desparate need of coffee methinks... that's the second error in
> > > EMBLFormat directly related to me being stupid when I cut-and-pasted the
> > > stuff for the new 87+ ID line format...
> > >
> > > Should be fixed now in CVS (as of about 30 seconds ago).
> > >
> > > cheers,
> > > Richard
> > >
> > > On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote:
> > > > Hi Richard,
> > > >
> > > > I go another exception on EMBL format:
> > > > =============================
> > > > org.biojava.bio.BioException: Could not read sequence
> > > > at
> > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:347)
> > > > Caused by: java.lang.IllegalStateException: No match found
> > > > at java.util.regex.Matcher.group(Matcher.java:461)
> > > > at
> > > > org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311)
> > > > at
> > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > ... 1 more
> > > > Java Result: -1
> > > > =============================
> > > > I used the same file from test directory:(AY069118.em)
> > > >
> > > >
> > > > Seth
> > > >
> > > > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> > > > > This one should be fixed in CVS now. Typo on my behalf - I put in code
> > > > > to make it work with both 87+ and pre-87 version of EMBL, then got the
> > > > > regexes the wrong way round!!
> > > > >
> > > > ...
> > > > >
> > > > > cheers,
> > > > > Richard
> > > > >
> > > > >
> > > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote:
> > > > > > Hi Richard,
> > > > > >
> > > > > > I made sure I have the latest source code from CVS compiled
> > > > > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy
> > > > > > to report that GenBank issue is solved!!!!
> > > > > > As far as EMBL parsing, I apologize for not providing the stack dump
> > > > > > for ISSUE #1. Here's the dump of the exception:
> > > > > > --------------------------------------------------------
> > > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > > at
> > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > > at
> > > > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:359)
> > > > > > Caused by: java.lang.NumberFormatException: null
> > > > > > at java.lang.Integer.parseInt(Integer.java:415)
> > > > > > at java.lang.Integer.parseInt(Integer.java:497)
> > > > > > at
> > > > > > org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299)
> > > > > > at
> > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > > > ... 1 more
> > > > > > Java Result: -1
> > > > > > -------------------------------------------------------
> > > > > > Here, again, is the code that I'm using to to parse:
> > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > BufferedReader gbBR = null;
> > > > > > try {
> > > > > > gbBR = new BufferedReader(new
> > > > > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb"));
> > > > > > } catch (FileNotFoundException fnfe) {
> > > > > > fnfe.printStackTrace();
> > > > > > System.exit(-1);
> > > > > > }
> > > > > > Namespace gbNspace = (Namespace)
> > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > > > Object[]{"gbSpace"} );
> > > > > > RichSequenceIterator gbSeqs =
> > > > > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace);
> > > > > > while (gbSeqs.hasNext()) {
> > > > > > try {
> > > > > > RichSequence rs = gbSeqs.nextRichSequence();
> > > > > > NCBITaxon myTaxon = rs.getTaxon();
> > > > > > }catch (BioException be){
> > > > > > be.printStackTrace();
> > > > > > System.exit(-1);
> > > > > > }
> > > > > > }
> > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > And here's the EMBL file that I'm trying to parse:
> > > > > > +++++++++++++++++++++++++
> > > > > > ID DQ472184 standard; DNA; INV; 546 BP.
> > > > > > XX
> > > > > > AC DQ472184;
> > > > > > XX
> > > > > > SV DQ472184.1
> > > > > > DT 15-MAY-2006
> > > > > > XX
> > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3
> > > > > > (ARC21) gene,
> > > > > > DE complete cds.
> > > > > > XX
> > > > > > KW .
> > > > > > XX
> > > > > > OS Trypanosoma cruzi strain CL Brener
> > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae;
> > > > > > Trypanosoma;
> > > > > > OC Schizotrypanum.
> > > > > > XX
> > > > > > RN [1]
> > > > > > RP 1-546
> > > > > > RA De Melo L.D.B.;
> > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding
> > > > > > proteins";
> > > > > > RL Unpublished.
> > > > > > XX
> > > > > > RN [2]
> > > > > > RP 1-546
> > > > > > RA De Melo L.D.B.;
> > > > > > RT ;
> > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade
> > > > > > Federal do Rio
> > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de
> > > > > > Janeiro, RJ
> > > > > > RL 21949-900, Brazil
> > > > > > XX
> > > > > > FH Key Location/Qualifiers
> > > > > > FH
> > > > > > FT source 1..546
> > > > > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > > > > FT /mol_type="genomic DNA"
> > > > > > FT /strain="CL Brener"
> > > > > > FT /db_xref="taxon:353153"
> > > > > > FT gene <1..>546
> > > > > > FT /gene="ARC21"
> > > > > > FT /note="TcARC21"
> > > > > > FT mRNA <1..>546
> > > > > > FT /gene="ARC21"
> > > > > > FT /product="actin-related protein 3"
> > > > > > FT CDS 1..546
> > > > > > FT /gene="ARC21"
> > > > > > FT /note="actin-binding protein; ARPC3 21 kDa;
> > > > > > putative
> > > > > > FT member of Arp2/3 complex"
> > > > > > FT /codon_start=1
> > > > > > FT /product="actin-related protein 3"
> > > > > > FT /protein_id="ABF13401.1"
> > > > > > FT /db_xref="GI:93360014"
> > > > > > FT
> > > > > > /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > > > > > FT
> > > > > > EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > > > > > FT
> > > > > > SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS"
> > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg
> > > > > > tgtttatccg 60
> > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga
> > > > > > tgaaatgatt 120
> > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt
> > > > > > ttttaagccc 180
> > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat
> > > > > > tctgtacttg 240
> > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga
> > > > > > agaggcccat 300
> > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa
> > > > > > ggactttccg 360
> > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg
> > > > > > agagtatgca 420
> > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct
> > > > > > ttttccagag 480
> > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt
> > > > > > cttggcttct 540
> > > > > > agttag
> > > > > > 546
> > > > > > //
> > > > > > ID DQ472185 standard; DNA; INV; 543 BP.
> > > > > > XX
> > > > > > AC DQ472185;
> > > > > > XX
> > > > > > SV DQ472185.1
> > > > > > DT 15-MAY-2006
> > > > > > XX
> > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4
> > > > > > (ARC20) gene,
> > > > > > DE complete cds.
> > > > > > XX
> > > > > > KW .
> > > > > > XX
> > > > > > OS Trypanosoma cruzi strain CL Brener
> > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae;
> > > > > > Trypanosoma;
> > > > > > OC Schizotrypanum.
> > > > > > XX
> > > > > > RN [1]
> > > > > > RP 1-543
> > > > > > RA De Melo L.D.B.;
> > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding
> > > > > > proteins";
> > > > > > RL Unpublished.
> > > > > > XX
> > > > > > RN [2]
> > > > > > RP 1-543
> > > > > > RA De Melo L.D.B.;
> > > > > > RT ;
> > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade
> > > > > > Federal do Rio
> > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de
> > > > > > Janeiro, RJ
> > > > > > RL 21949-900, Brazil
> > > > > > XX
> > > > > > FH Key Location/Qualifiers
> > > > > > FH
> > > > > > FT source 1..543
> > > > > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > > > > FT /mol_type="genomic DNA"
> > > > > > FT /strain="CL Brener"
> > > > > > FT /db_xref="taxon:353153"
> > > > > > FT gene <1..>543
> > > > > > FT /gene="ARC20"
> > > > > > FT /note="TcARC20"
> > > > > > FT mRNA <1..>543
> > > > > > FT /gene="ARC20"
> > > > > > FT /product="actin-related protein 4"
> > > > > > FT CDS 1..543
> > > > > > FT /gene="ARC20"
> > > > > > FT /note="actin-binding protein; ARPC4 20 kDa;
> > > > > > putative
> > > > > > FT member of Arp2/3 complex"
> > > > > > FT /codon_start=1
> > > > > > FT /product="actin-related protein 4"
> > > > > > FT /protein_id="ABF13402.1"
> > > > > > FT /db_xref="GI:93360016"
> > > > > > FT
> > > > > > /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > > > > > FT
> > > > > > LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > > > > > FT
> > > > > > GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > > > > > FT MKLNVNQRARRAAMEFFLALNFT"
> > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca
> > > > > > cgcggctttg 60
> > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga
> > > > > > agttgaggtt 120
> > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct
> > > > > > taaccccata 180
> > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa
> > > > > > cagcacacgc 240
> > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg
> > > > > > aaagtacgtt 300
> > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc
> > > > > > tattccggga 360
> > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag
> > > > > > gaataggatt 420
> > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc
> > > > > > aatgaagttg 480
> > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt
> > > > > > gaatttcaca 540
> > > > > > tga
> > > > > > 543
> > > > > > //
> > > > > > +++++++++++++++++++++++++++++++++
> > > > > >
> > > > > > It looks to me like there's some kind of problem with parsing the
> > > > > > sequence version number. I even tried the sequence from test
> > > > > > directory
> > > > > > (AY069118.em) with same outcome.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Seth
> > > > > >
> > > > > > On 6/2/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> > > > > > > Hi Seth.
> > > > > > >
> > > > > > > Your second point, about the authors string not being read
> > > > > > > correctly in
> > > > > > > Genbank format, has been fixed (or should have been if I got the
> > > > > > > code
> > > > > > > right!). Could you check the latest version of biojava-live out
> > > > > > > of CVS
> > > > > > > and give it another go? Basically the parser did not recognise the
> > > > > > > CONSRTM tag, as it is not mentioned in the sample record provided
> > > > > > > by
> > > > > > > NCBI, which is what I based the parser on.
> > > > > > >
> > > > > > > I've set it up now so that it reads the CONSRTM tag, but the
> > > > > > > value is
> > > > > > > merged with the authors tag with (consortium) appended. There
> > > > > > > will still
> > > > > > > be problems if the consortium value has commas in it - not sure
> > > > > > > how to
> > > > > > > fix this yet.
> > > > > > >
> > > > > > > Your first point is harder to solve because you did not provide a
> > > > > > > complete stack trace for the exceptions you are getting. The
> > > > > > > complete
> > > > > > > stack trace would enable me to identify exactly where things are
> > > > > > > going
> > > > > > > wrong and give me a better chance of fixing them. Could you send
> > > > > > > the
> > > > > > > stack trace, and I'll see what I can do.
> > > > > > >
> > > > > > > cheers,
> > > > > > > Richard
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote:
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get
> > > > > > > > some
> > > > > > > > clarification on several issues that I'm having.
> > > > > > > > I am developing a parser that would take as input "NCBI
> > > > > > > > Incremental
> > > > > > > > ASN.1 Sequence Updates to Genbank" files (
> > > > > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and
> > > > > > > > use the
> > > > > > > > ASN2GB converter (
> > > > > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to
> > > > > > > > convert
> > > > > > > > resulting sequences to a format parsable by BioJava(X) (
> > > > > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This
> > > > > > > > is where
> > > > > > > > my problems start.
> > > > > > > >
> > > > > > > > ISSUE 1:
> > > > > > > > I've tried to parse all of the formats that ASN2GB outputs (
> > > > > > > > GenBank
> > > > > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet
> > > > > > > > (XML),
> > > > > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only
> > > > > > > > GenBank
> > > > > > > > format is recognized by the
> > > > > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function
> > > > > > > > with
> > > > > > > > some exceptions that I'll describe in issue #2. This is the
> > > > > > > > code that
> > > > > > > > I'm using to parse, for example, the EMBL output:
> > > > > > > >
> > > > > > > > BufferedReader inBuf = new BufferedReader(new
> > > > > > > > FileReader("embl_output.emb"));
> > > > > > > > Namespace gbNspace = (Namespace)
> > > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > > > > > Object[]{"gbSpace"} );
> > > > > > > > RichSequenceIterator gbSeqs =
> > > > > > > > RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
> > > > > > > > while (gbSeqs.hasNext()) {
> > > > > > > > try {
> > > > > > > > RichSequence rs = gbSeqs.nextRichSequence();
> > > > > > > > // Further processing or RichSequence object from
> > > > > > > > here
> > > > > > > >
> > > > > > > > } catch (BioException be){
> > > > > > > > be.printStackTrace();
> > > > > > > > }
> > > > > > > > }
> > > > > > > >
> > > > > > > > The multi-sequence EMBL file looks like this:
> > > > > > > > ---------------------------------------------------------------------------------
> > > > > > > > ID DQ472184 standard; DNA; INV; 546 BP.
> > > > > > > > XX
> > > > > > > > AC DQ472184;
> > > > > > > > XX
> > > > > > > > SV DQ472184.1
> > > > > > > > DT 15-MAY-2006
> > > > > > > > XX
> > > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3
> > > > > > > > (ARC21) gene,
> > > > > > > > DE complete cds.
> > > > > > > > XX
> > > > > > > > KW .
> > > > > > > > XX
> > > > > > > > OS Trypanosoma cruzi strain CL Brener
> > > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae;
> > > > > > > > Trypanosoma;
> > > > > > > > OC Schizotrypanum.
> > > > > > > > XX
> > > > > > > > RN [1]
> > > > > > > > RP 1-546
> > > > > > > > RA De Melo L.D.B.;
> > > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding
> > > > > > > > proteins";
> > > > > > > > RL Unpublished.
> > > > > > > > XX
> > > > > > > > RN [2]
> > > > > > > > RP 1-546
> > > > > > > > RA De Melo L.D.B.;
> > > > > > > > RT ;
> > > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade
> > > > > > > > Federal do Rio
> > > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio
> > > > > > > > de Janeiro, RJ
> > > > > > > > RL 21949-900, Brazil
> > > > > > > > XX
> > > > > > > > FH Key Location/Qualifiers
> > > > > > > > FH
> > > > > > > > FT source 1..546
> > > > > > > > FT /organism="Trypanosoma cruzi strain CL
> > > > > > > > Brener"
> > > > > > > > FT /mol_type="genomic DNA"
> > > > > > > > FT /strain="CL Brener"
> > > > > > > > FT /db_xref="taxon:353153"
> > > > > > > > FT gene <1..>546
> > > > > > > > FT /gene="ARC21"
> > > > > > > > FT /note="TcARC21"
> > > > > > > > FT mRNA <1..>546
> > > > > > > > FT /gene="ARC21"
> > > > > > > > FT /product="actin-related protein 3"
> > > > > > > > FT CDS 1..546
> > > > > > > > FT /gene="ARC21"
> > > > > > > > FT /note="actin-binding protein; ARPC3 21
> > > > > > > > kDa; putative
> > > > > > > > FT member of Arp2/3 complex"
> > > > > > > > FT /codon_start=1
> > > > > > > > FT /product="actin-related protein 3"
> > > > > > > > FT /protein_id="ABF13401.1"
> > > > > > > > FT /db_xref="GI:93360014"
> > > > > > > > FT
> > > > > > > > /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > > > > > > > FT
> > > > > > > > EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > > > > > > > FT
> > > > > > > > SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > > > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS"
> > > > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg
> > > > > > > > tgtttatccg 60
> > > > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga
> > > > > > > > tgaaatgatt 120
> > > > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt
> > > > > > > > ttttaagccc 180
> > > > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat
> > > > > > > > tctgtacttg 240
> > > > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga
> > > > > > > > agaggcccat 300
> > > > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa
> > > > > > > > ggactttccg 360
> > > > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg
> > > > > > > > agagtatgca 420
> > > > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct
> > > > > > > > ttttccagag 480
> > > > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt
> > > > > > > > cttggcttct 540
> > > > > > > > agttag
> > > > > > > > 546
> > > > > > > > //
> > > > > > > > ID DQ472185 standard; DNA; INV; 543 BP.
> > > > > > > > XX
> > > > > > > > AC DQ472185;
> > > > > > > > XX
> > > > > > > > SV DQ472185.1
> > > > > > > > DT 15-MAY-2006
> > > > > > > > XX
> > > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4
> > > > > > > > (ARC20) gene,
> > > > > > > > DE complete cds.
> > > > > > > > XX
> > > > > > > > KW .
> > > > > > > > XX
> > > > > > > > OS Trypanosoma cruzi strain CL Brener
> > > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae;
> > > > > > > > Trypanosoma;
> > > > > > > > OC Schizotrypanum.
> > > > > > > > XX
> > > > > > > > RN [1]
> > > > > > > > RP 1-543
> > > > > > > > RA De Melo L.D.B.;
> > > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding
> > > > > > > > proteins";
> > > > > > > > RL Unpublished.
> > > > > > > > XX
> > > > > > > > RN [2]
> > > > > > > > RP 1-543
> > > > > > > > RA De Melo L.D.B.;
> > > > > > > > RT ;
> > > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade
> > > > > > > > Federal do Rio
> > > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio
> > > > > > > > de Janeiro, RJ
> > > > > > > > RL 21949-900, Brazil
> > > > > > > > XX
> > > > > > > > FH Key Location/Qualifiers
> > > > > > > > FH
> > > > > > > > FT source 1..543
> > > > > > > > FT /organism="Trypanosoma cruzi strain CL
> > > > > > > > Brener"
> > > > > > > > FT /mol_type="genomic DNA"
> > > > > > > > FT /strain="CL Brener"
> > > > > > > > FT /db_xref="taxon:353153"
> > > > > > > > FT gene <1..>543
> > > > > > > > FT /gene="ARC20"
> > > > > > > > FT /note="TcARC20"
> > > > > > > > FT mRNA <1..>543
> > > > > > > > FT /gene="ARC20"
> > > > > > > > FT /product="actin-related protein 4"
> > > > > > > > FT CDS 1..543
> > > > > > > > FT /gene="ARC20"
> > > > > > > > FT /note="actin-binding protein; ARPC4 20
> > > > > > > > kDa; putative
> > > > > > > > FT member of Arp2/3 complex"
> > > > > > > > FT /codon_start=1
> > > > > > > > FT /product="actin-related protein 4"
> > > > > > > > FT /protein_id="ABF13402.1"
> > > > > > > > FT /db_xref="GI:93360016"
> > > > > > > > FT
> > > > > > > > /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > > > > > > > FT
> > > > > > > > LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > > > > > > > FT
> > > > > > > > GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > > > > > > > FT MKLNVNQRARRAAMEFFLALNFT"
> > > > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca
> > > > > > > > cgcggctttg 60
> > > > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga
> > > > > > > > agttgaggtt 120
> > > > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct
> > > > > > > > taaccccata 180
> > > > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa
> > > > > > > > cagcacacgc 240
> > > > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg
> > > > > > > > aaagtacgtt 300
> > > > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc
> > > > > > > > tattccggga 360
> > > > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag
> > > > > > > > gaataggatt 420
> > > > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc
> > > > > > > > aatgaagttg 480
> > > > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt
> > > > > > > > gaatttcaca 540
> > > > > > > > tga
> > > > > > > > 543
> > > > > > > > //
> > > > > > > > -----------------------------------------------------------------------
> > > > > > > > I get an exception message "Could Not Read Sequence". Same
> > > > > > > > thing
> > > > > > > > happens if I use the readINSDSetDNA reader instead of
> > > > > > > > readEMBLDNA one
> > > > > > > > with the following INSDset file (beginning of the file):
> > > > > > > >
> > > > > > > > <?xml version="1.0"?>
> > > > > > > > <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN"
> > > > > > > > "INSD_INSDSeq.dtd">
> > > > > > > > <INSDSeq>
> > > > > > > > <INSDSeq_locus>DQ022078</INSDSeq_locus>
> > > > > > > > <INSDSeq_length>16729</INSDSeq_length>
> > > > > > > > <INSDSeq_moltype>DNA</INSDSeq_moltype>
> > > > > > > > <INSDSeq_topology>linear</INSDSeq_topology>
> > > > > > > > <INSDSeq_division>ENV</INSDSeq_division>
> > > > > > > > <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
> > > > > > > > <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
> > > > > > > > <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
> > > > > > > > aminoglycoside phosphotransferase (a3.001), putative
> > > > > > > > oxidoreductase
> > > > > > > > (a3.002), putative oxidoreductase (a3.003), putative
> > > > > > > > beta-lactamase
> > > > > > > > class C (estA3), putative permease (a3.005), putative
> > > > > > > > transmembrane
> > > > > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007),
> > > > > > > > histone
> > > > > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009),
> > > > > > > > putative
> > > > > > > > asparaginase (a3.010), hypothetical protein (a3.011),
> > > > > > > > hypothetical
> > > > > > > > protein (a3.012), putative membrane protease subunit (a3.013),
> > > > > > > > putative haloalkane dehalogenase (a3.014), putative
> > > > > > > > transcriptional
> > > > > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016),
> > > > > > > > and
> > > > > > > > hypothetical protein (a3.017) genes, complete
> > > > > > > > cds</INSDSeq_definition>
> > > > > > > >
> > > > > > > > <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
> > > > > > > > <INSDSeq_other-seqids>
> > > > > > > > <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
> > > > > > > > <INSDSeqid>gi|71842722</INSDSeqid>
> > > > > > > > </INSDSeq_other-seqids>
> > > > > > > > <INSDSeq_keywords>
> > > > > > > > <INSDKeyword>ENV</INSDKeyword>
> > > > > > > > </INSDSeq_keywords>
> > > > > > > > <INSDSeq_references>
> > > > > > > > <INSDReference>
> > > > > > > > <INSDReference_reference>?</INSDReference_reference>
> > > > > > > > <INSDReference_position>1..16729</INSDReference_position>
> > > > > > > > <INSDReference_authors>
> > > > > > > > <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > > > > > <INSDAuthor>Elend,C.</INSDAuthor>
> > > > > > > > <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > > > > > </INSDReference_authors>
> > > > > > > > <INSDReference_title>Isolation and biochemical
> > > > > > > > characterization
> > > > > > > > of two novel metagenome derived esterases</INSDReference_title>
> > > > > > > > <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
> > > > > > > > (2006)</INSDReference_journal>
> > > > > > > > </INSDReference>
> > > > > > > > <INSDReference>
> > > > > > > > <INSDReference_reference>?</INSDReference_reference>
> > > > > > > > <INSDReference_position>1..16729</INSDReference_position>
> > > > > > > > <INSDReference_authors>
> > > > > > > > <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > > > > > <INSDAuthor>Elend,C.</INSDAuthor>
> > > > > > > > <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > > > > > </INSDReference_authors>
> > > > > > > > <INSDReference_journal>Submitted (29-APR-2005) to the
> > > > > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie,
> > > > > > > > University
> > > > > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
> > > > > > > > Germany</INSDReference_journal>
> > > > > > > > </INSDReference>
> > > > > > > > </INSDSeq_references>
> > > > > > > >
> > > > > > > > So my question is wether the ASN2GB produces output that's
> > > > > > > > incompatible with BioJava parsers or is there a problem with the
> > > > > > > > sequence themselves or the problems with the majority of
> > > > > > > > parsers???
> > > > > > > > Could it be that I'm using the API wrongly for the above
> > > > > > > > formats,
> > > > > > > > although GenBank parser works as advertised with some exceptions
> > > > > > > > below:
> > > > > > > >
> > > > > > > > ISSUE #2:
> > > > > > > > When I try to parse GenBank files using the following code:
> > > > > > > >
> > > > > > > > BufferedReader inBuf = new BufferedReader(new
> > > > > > > > FileReader("genbank_output.gb"));
> > > > > > > > Namespace gbNspace = (Namespace)
> > > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > > > > > Object[]{"gbSpace"} );
> > > > > > > > RichSequenceIterator gbSeqs =
> > > > > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
> > > > > > > > while (gbSeqs.hasNext()) {
> > > > > > > > try {
> > > > > > > > RichSequence rs = gbSeqs.nextRichSequence();
> > > > > > > > // Further processing or RichSequence object from
> > > > > > > > here
> > > > > > > >
> > > > > > > > } catch (BioException be){
> > > > > > > > be.printStackTrace();
> > > > > > > > }
> > > > > > > > }
> > > > > > > >
> > > > > > > > Genbank file in question:
> > > > > > > >
> > > > > > > > LOCUS BC074905 838 bp mRNA linear
> > > > > > > > PRI 15-APR-2006
> > > > > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone
> > > > > > > > MGC:104038
> > > > > > > > IMAGE:30915482), complete cds.
> > > > > > > > ACCESSION BC074905
> > > > > > > > VERSION BC074905.2 GI:50959825
> > > > > > > > KEYWORDS MGC.
> > > > > > > > SOURCE Homo sapiens (human)
> > > > > > > > ORGANISM Homo sapiens
> > > > > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> > > > > > > > Euteleostomi;
> > > > > > > > Mammalia; Eutheria; Euarchontoglires; Primates;
> > > > > > > > Haplorrhini;
> > > > > > > > Catarrhini; Hominidae; Homo.
> > > > > > > > REFERENCE 1 (bases 1 to 838)
> > > > > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H.,
> > > > > > > > Derge,J.G.,
> > > > > > > > Klausner,R.D., Collins,F.S., Wagner,L.,
> > > > > > > > Shenmen,C.M., Schuler,G.D.,
> > > > > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H.,
> > > > > > > > Schaefer,C.F., Bhat,N.K.,
> > > > > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I.,
> > > > > > > > Wang,J., Hsieh,F.,
> > > > > > > > Diatchenko,L., Marusina,K., Farmer,A.A.,
> > > > > > > > Rubin,G.M., Hong,L.,
> > > > > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F.,
> > > > > > > > Casavant,T.L.,
> > > > > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B.,
> > > > > > > > Toshiyuki,S.,
> > > > > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A.,
> > > > > > > > Peters,G.J.,
> > > > > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A.,
> > > > > > > > McEwan,P.J.,
> > > > > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H.,
> > > > > > > > Richards,S.,
> > > > > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J.,
> > > > > > > > Hulyk,S.W.,
> > > > > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X.,
> > > > > > > > Gibbs,R.A.,
> > > > > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A.,
> > > > > > > > Rodrigues,S.,
> > > > > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C.,
> > > > > > > > Shevchenko,Y.,
> > > > > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W.,
> > > > > > > > Green,E.D.,
> > > > > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J.,
> > > > > > > > Schmutz,J., Myers,R.M.,
> > > > > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U.,
> > > > > > > > Smailus,D.E.,
> > > > > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
> > > > > > > > CONSRTM Mammalian Gene Collection Program Team
> > > > > > > > TITLE Generation and initial analysis of more than 15,000
> > > > > > > > full-length
> > > > > > > > human and mouse cDNA sequences
> > > > > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903
> > > > > > > > (2002)
> > > > > > > > PUBMED 12477932
> > > > > > > > REFERENCE 2 (bases 1 to 838)
> > > > > > > > CONSRTM NIH MGC Project
> > > > > > > > TITLE Direct Submission
> > > > > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of
> > > > > > > > Health, Mammalian
> > > > > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA
> > > > > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov
> > > > > > > > COMMENT On Aug 4, 2004 this sequence version replaced
> > > > > > > > gi:49901832.
> > > > > > > > Contact: MGC help desk
> > > > > > > > Email: [EMAIL PROTECTED]
> > > > > > > > Tissue Procurement: Genome Sequence Centre, British
> > > > > > > > Columbia Cancer
> > > > > > > > Center
> > > > > > > > cDNA Library Preparation: British Columbia Cancer
> > > > > > > > Research Center
> > > > > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium
> > > > > > > > (LLNL)
> > > > > > > > DNA Sequencing by: Genome Sequence Centre,
> > > > > > > > BC Cancer Agency, Vancouver, BC, Canada
> > > > > > > > [EMAIL PROTECTED]
> > > > > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle
> > > > > > > > Moksa, Johnson
> > > > > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric
> > > > > > > > Chuah, Allen
> > > > > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah
> > > > > > > > Barber, Mabel
> > > > > > > > Brown-John, Steve S. Chand, William Chow, Ryan
> > > > > > > > Babakaiff, Dave
> > > > > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson,
> > > > > > > > Luis delRio, Ruth
> > > > > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran
> > > > > > > > Guin, Nancy Liao,
> > > > > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana
> > > > > > > > Palmquist, JR
> > > > > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai,
> > > > > > > > George Yang,
> > > > > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob
> > > > > > > > Holt, Marco Marra.
> > > > > > > >
> > > > > > > > Clone distribution: MGC clone distribution
> > > > > > > > information can be found
> > > > > > > > through the I.M.A.G.E. Consortium/LLNL at:
> > > > > > > > http://image.llnl.gov
> > > > > > > > Series: IRBU Plate: 4 Row: C Column: 3.
> > > > > > > >
> > > > > > > > Differences found between this sequence and the
> > > > > > > > human reference
> > > > > > > > genome (build 36) are described in misc_difference
> > > > > > > > features below.
> > > > > > > > FEATURES Location/Qualifiers
> > > > > > > > source 1..838
> > > > > > > > /organism="Homo sapiens"
> > > > > > > > /mol_type="mRNA"
> > > > > > > > /db_xref="taxon:9606"
> > > > > > > > /clone="MGC:104038 IMAGE:30915482"
> > > > > > > > /tissue_type="Lung, PCR rescued clones"
> > > > > > > > /clone_lib="NIH_MGC_273"
> > > > > > > > /lab_host="DH10B"
> > > > > > > > /note="Vector: pCR4 Topo TA with reversed
> > > > > > > > insert"
> > > > > > > > gene 1..838
> > > > > > > > /gene="KLK14"
> > > > > > > > /note="synonym: KLK-L6"
> > > > > > > > /db_xref="GeneID:43847"
> > > > > > > > /db_xref="HGNC:6362"
> > > > > > > > /db_xref="IMGT/GENE-DB:6362"
> > > > > > > > /db_xref="MIM:606135"
> > > > > > > > CDS 49..804
> > > > > > > > /gene="KLK14"
> > > > > > > > /codon_start=1
> > > > > > > > /product="KLK14 protein"
> > > > > > > > /protein_id="AAH74905.1"
> > > > > > > > /db_xref="GI:50959826"
> > > > > > > > /db_xref="GeneID:43847"
> > > > > > > > /db_xref="HGNC:6362"
> > > > > > > > /db_xref="IMGT/GENE-DB:6362"
> > > > > > > > /db_xref="MIM:606135"
> > > > > > > >
> > > > > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
> > > > > > > >
> > > > > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
> > > > > > > >
> > > > > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
> > > > > > > >
> > > > > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
> > > > > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
> > > > > > > > misc_difference 98
> > > > > > > > /gene="KLK14"
> > > > > > > > /note="'G' in cDNA is 'A' in the human
> > > > > > > > genome; amino acid
> > > > > > > > difference: 'R' in cDNA, 'Q' in the human
> > > > > > > > genome."
> > > > > > > > misc_difference 133
> > > > > > > > /gene="KLK14"
> > > > > > > > /note="'T' in cDNA is 'C' in the human
> > > > > > > > genome; amino acid
> > > > > > > > difference: 'Y' in cDNA, 'H' in the human
> > > > > > > > genome."
> > > > > > > > ORIGIN
> > > > > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag
> > > > > > > > cccctaaaat gttcctcctg
> > > > > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga
> > > > > > > > gccaagagga tgagaacaag
> > > > > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt
> > > > > > > > ggcaggcggc cctgctggcg
> > > > > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt
> > > > > > > > caggccagtg ggtcatcact
> > > > > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg
> > > > > > > > gcaagcacaa cctgaggagg
> > > > > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg
> > > > > > > > tgacgcaccc caactacaac
> > > > > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac
> > > > > > > > agcagcccgc acggatcggg
> > > > > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca
> > > > > > > > gccccgggac ctcctgccga
> > > > > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt
> > > > > > > > accccgcctc tctgcaatgc
> > > > > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg
> > > > > > > > cctatcctag aaccatcacg
> > > > > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg
> > > > > > > > actcttgtca gggtgactct
> > > > > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg
> > > > > > > > tgtcttgggg aatggagcgc
> > > > > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt
> > > > > > > > gcaagtacag aagctggatt
> > > > > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg
> > > > > > > > atggacctcg tcagctgc
> > > > > > > > //
> > > > > > > >
> > > > > > > > I get the following exception:
> > > > > > > >
> > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be
> > > > > > > > null
> > > > > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > > > > at
> > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > > > > at
> > > > > > > > exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
> > > > > > > > at
> > > > > > > > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
> > > > > > > > at
> > > > > > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
> > > > > > > > Caused by: java.lang.IllegalArgumentException: Authors string
> > > > > > > > cannot be null
> > > > > > > > at
> > > > > > > > org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
> > > > > > > > at
> > > > > > > > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
> > > > > > > > at
> > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > > > > >
> > > > > > > > -----------------------------------------------------------------------
> > > > > > > >
> > > > > > > > I'm trying to see what could be the problem with this particular
> > > > > > > > sequence. Looks to me like the AUTHORS portion is not getting
> > > > > > > > parsed
> > > > > > > > correctly. Any ideas would be greatly appreciated!
> > > > > > > >
> > > > > > > --
> > > > > > > Richard Holland (BioMart Team)
> > > > > > > EMBL-EBI
> > > > > > > Wellcome Trust Genome Campus
> > > > > > > Hinxton
> > > > > > > Cambridge CB10 1SD
> > > > > > > UNITED KINGDOM
> > > > > > > Tel: +44-(0)1223-494416
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > --
> > > > > Richard Holland (BioMart Team)
> > > > > EMBL-EBI
> > > > > Wellcome Trust Genome Campus
> > > > > Hinxton
> > > > > Cambridge CB10 1SD
> > > > > UNITED KINGDOM
> > > > > Tel: +44-(0)1223-494416
> > > > >
> > > > >
> > > >
> > > >
> > > --
> > > Richard Holland (BioMart Team)
> > > EMBL-EBI
> > > Wellcome Trust Genome Campus
> > > Hinxton
> > > Cambridge CB10 1SD
> > > UNITED KINGDOM
> > > Tel: +44-(0)1223-494416
> > >
> > >
> >
> >
> --
> Richard Holland (BioMart Team)
> EMBL-EBI
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> UNITED KINGDOM
> Tel: +44-(0)1223-494416
>
>
--
Best Regards,
Seth Johnson
Senior Bioinformatics Associate
Ph: (202) 470-0900
Fx: (775) 251-0358
_______________________________________________
Biojava-l mailing list - [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l