What my experience says on this issue we should make use of taxon_id
because its a unique key in a local instance of biosql.
ncbi_taxon_id should only be used for mapping purpose only so that a
person can map his local taxon_id to a ncbi_taxon_id otherwise it defeat
the sole purpose of having taxon_id as primary key in taxon table. The
main goal which I think when biosql is designed is to make it
independent of any other organization like genbank or NCBI but its a
feature so that we can map a number(ncbi_taxon_id) given by a know
authority to a local number (taxon_id).
Deepak Sheoran
On 4/15/2010 12:54 PM, Peter wrote:
Hi,
I've CC'd this to the BioSQL mailing list for cross project
discussion.
On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland wrote:
Thanks Deepak.
I've had a look at the code and I believe its due to the
different ways in which BioJava and BioPerl load the
taxon table.
BioJava sets the ncbi_taxon_id and parent_taxon_id
columns based on the values from the NCBI taxonomy
file. The taxon_id column in BioJava is a meaningless
auto-generated value that is never used.
BioPerl however is generating taxon_id values and
linking them by setting parent_taxon_id to the
generated value. The parent value from the NCBI
taxonomy file is therefore replaced with the BioPerl
generated parent ID, meaning that instead of linking
from parent_taxon_id to ncbi_taxon_id as per BioJava,
the link is to taxon_id instead. (I'm basing this
comment on looking at load_ncbi_taxonomy.pl from
the BioSQL archives.)
Note that old versions of load_ncbi_taxonomy.pl
(which is part of BioSQL, not part of BioPerl) would
set taxon_id equal to ncbi_taxon_id, see:
http://bugzilla.open-bio.org/show_bug.cgi?id=2470
This may help explain the confusion.
I believe if you load the taxonomy table using BioJava,
you should see BioJava giving correct behaviour.
Likewise if you load it using BioPerl, BioPerl will
behave correctly. But if you load with one then query
with the other, you'll get incorrect results.
This sounds like a case for discussion on both lists -
a matter of standardisation between the two projects.
Not quickly/easily solvable for now.
Its not just two projects (BioPerl& BioJava) (grin).
Its at least five projects (BioSQL itself plus BioRuby
and Biopython).
I'm not sure about BioRuby's implementation, but
currently I think BioJava is the odd one out - BioPerl,
Biopython, and the BioSQL's load_ncbi_taxonomy.pl
all make entries in parent_taxon_id reference the
automatically generated taxon_id (please correct
me if I am wrong).
My personal view is that bioperl-db is the reference
implementation and should be followed in the event
of any ambiguity within BioSQL. In this particular
case, there is actually a BioSQL script to check
against too (load_ncbi_taxonomy.pl).
Hopefully Hilmar can give us an official verdict...
Peter
_______________________________________________
Biojava-l mailing list - [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l