I've run the sequence through the parser and it seems to work OK. I iterate through the features and then iterate through the annotations of that feature
Based on the input.... FT source 1..118 FT /organism="Triturus helveticus" FT /mol_type="genomic DNA" FT /clone="Thel.b9" FT /db_xref="taxon:256425" FT gene <1..>118 FT /gene="Hoxb9" FT /note="Hoxb-9" FT mRNA <1..>118 FT /gene="Hoxb9" FT /product="HOXB9" FT CDS <1..>118 FT /codon_start=2 FT /gene="Hoxb9" FT /product="HOXB9" FT /db_xref="UniProtKB/TrEMBL:Q2LK47" FT /protein_id="ABA39736.1" FT /translation="KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW" The output is.... ======================================== Feature: (#0) lcl:DQ158013/DQ158013.1:source,EMBL(1..118) Note: (#0) biojavax:mol_type: genomic DNA Note: (#1) biojavax:clone: Thel.b9 ======================================== Feature: (#1) lcl:DQ158013/DQ158013.1:gene,EMBL(<1..118>) Note: (#2) biojavax:gene: Hoxb9 Note: (#3) biojavax:note: Hoxb-9 ======================================== Feature: (#2) lcl:DQ158013/DQ158013.1:mRNA,EMBL(<1..118>) Note: (#4) biojavax:gene: Hoxb9 Note: (#5) biojavax:product: HOXB9 ======================================== Feature: (#3) lcl:DQ158013/DQ158013.1:CDS,EMBL(<1..118>) Note: (#6) biojavax:codon_start: 2 Note: (#7) biojavax:gene: Hoxb9 Note: (#8) biojavax:product: HOXB9 Note: (#9) biojavax:protein_id: ABA39736.1 Note: (#10) biojavax:translation: KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW Note: (#11) biojavax:translation: KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW ============================================= This looks OK, the one thing I've just noticed is that the last piece of annotation of the last feature is assigned twice. Jolyon -----Original Message----- From: Richard Holland [mailto:[EMAIL PROTECTED] Sent: 20 April 2006 13:05 To: [EMAIL PROTECTED] Cc: Jolyon Holdstock; [EMAIL PROTECTED] Subject: Re: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned] Hi. I made some small changes to the code, although nothing that would fix this kind of problem, committed it back to CVS, checked it out again, compiled, and ran a test program that read in an EMBL file with the feature table you describe below, and output it in EMBL format to another file. I then compared the two files... and found no differences! The split-on-equals problem didn't occur, and all notes appeared alongside their correct features. Could there be a problem maybe with the script you are using? I've really no idea what the problem is as I can't reproduce it based on the current CVS contents! cheers, Richard On Thu, 2006-04-20 at 11:35 +0200, Morgane THOMAS-CHOLLIER wrote: > Hi, > > I have tested today's version from CVS. > > Both EBI and Ensembl files now react the same way. > The last annotation of a feature is nevertheless related to its > immediate following feature. > e.g. : > > FT gene <1..>118 > FT /gene="Hoxb9" > FT /note="Hoxb-9" > FT mRNA <1..>118 > FT /gene="Hoxb9" > FT /product="HOXB9" > FT CDS <1..>118 > > /note="Hoxb-9" is related to mRNA > /product="HOXB9" is related to CDS > > Concerning the split-on-equals problem, I still observe the problem : > > [(#2) biojavax:note: transcript_i] > > for this annotation : /note="transcript_id=ENSMUST00000048680" > > Thanks for helping, > > Cheers, > > Morgane. > > Richard Holland wrote: > > I have committed an UNTESTED patch based on Jolyon's suggestion, and > > also attempted to fix the split-on-equals problem Morgane observed. > > > > Please let me know if there are any problems with it. > > > > As this problem affected the UniProt parser in a similar manner (much of > > the code is identical), the same fixes were applied there too. > > > > cheers, > > Richard > > > > On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote: > > > >> Hi Morgane, > >> > >> I have amended the EmblFormat readSection method as below and the > >> parsing seems to work; please test it. > >> > >> I think that the last bit of annotation is carried over into the next > >> feature so before adding the new feature I dump the annotation and reset > >> currentTag and currentVal. > >> > >> if (!line.startsWith(" ")) { > >> //--------- new code starts --------------------------- > >> if (currentTag!=null) { > >> section.add(new String[]{currentTag,currentVal.toString()}); > >> currentTag = null; > >> currentVal = null; > >> } > >> //--------- new code ends ----------------------------- > >> // case 1 : word value - splits into key-value on its own > >> section.add(line.split("\\s+")); > >> } > >> > >> Cheers, > >> > >> Jolyon > >> > >> > >> > >> -----Original Message----- > >> From: [EMAIL PROTECTED] > >> [mailto:[EMAIL PROTECTED] On Behalf Of Morgane > >> THOMAS-CHOLLIER > >> Sent: 12 April 2006 09:35 > >> To: [EMAIL PROTECTED] > >> Subject: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned] > >> > >> Hello again, > >> > >> I am currently using biojavax to parse EMBL files exported from Ensembl > >> website. > >> > >> Compared to the EBI files I have, they show a difference in the Features > >> > >> lines : > >> > >> sometimes, only one "/word" is present. ie: > >> > >> EBI file : > >> > >> FT gene <1..>118 > >> FT /gene="Hoxb9" > >> FT /note="Hoxb-9" > >> > >> Ensembl file; > >> > >> FT gene complement(1..3218) > >> FT /gene="ENSMUSG00000038227" > >> > >> The problem I encounter is that the parser correctly convert the "/word" > >> > >> into a Note, but the Note is then in relation with the immediate > >> following feature (ie: mRNA). > >> The current gene feature thus has no annotation. > >> > >> This behavior is reproducible when removing one "/word" of an EBI file. > >> > >> Apart from this issue, I noted that Ensembl EMBL files uses "=" inside a > >> > >> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends up > >> with an incomplete Note, as the parser seems to split on "=" to separate > >> > >> the Key and the Value. > >> > >> Thanks for your help, > >> > >> Morgane. > >> > >> > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 This email has been scanned by Oxford Gene Technology Group of Companies Security Systems. _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
