Hi, I've CC'd this to the BioSQL mailing list for cross project discussion.
On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland wrote: > Thanks Deepak. > > I've had a look at the code and I believe its due to the > different ways in which BioJava and BioPerl load the > taxon table. > > BioJava sets the ncbi_taxon_id and parent_taxon_id > columns based on the values from the NCBI taxonomy > file. The taxon_id column in BioJava is a meaningless > auto-generated value that is never used. > > BioPerl however is generating taxon_id values and > linking them by setting parent_taxon_id to the > generated value. The parent value from the NCBI > taxonomy file is therefore replaced with the BioPerl > generated parent ID, meaning that instead of linking > from parent_taxon_id to ncbi_taxon_id as per BioJava, > the link is to taxon_id instead. (I'm basing this > comment on looking at load_ncbi_taxonomy.pl from > the BioSQL archives.) Note that old versions of load_ncbi_taxonomy.pl (which is part of BioSQL, not part of BioPerl) would set taxon_id equal to ncbi_taxon_id, see: http://bugzilla.open-bio.org/show_bug.cgi?id=2470 This may help explain the confusion. > I believe if you load the taxonomy table using BioJava, > you should see BioJava giving correct behaviour. > Likewise if you load it using BioPerl, BioPerl will > behave correctly. But if you load with one then query > with the other, you'll get incorrect results. > > This sounds like a case for discussion on both lists - > a matter of standardisation between the two projects. > Not quickly/easily solvable for now. Its not just two projects (BioPerl & BioJava) (grin). Its at least five projects (BioSQL itself plus BioRuby and Biopython). I'm not sure about BioRuby's implementation, but currently I think BioJava is the odd one out - BioPerl, Biopython, and the BioSQL's load_ncbi_taxonomy.pl all make entries in parent_taxon_id reference the automatically generated taxon_id (please correct me if I am wrong). My personal view is that bioperl-db is the reference implementation and should be followed in the event of any ambiguity within BioSQL. In this particular case, there is actually a BioSQL script to check against too (load_ncbi_taxonomy.pl). Hopefully Hilmar can give us an official verdict... Peter _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
