Incidentally, BioJava's approach matches the description in the BioSQL docs at:

 http://biosql.org/wiki/Schema_Overview#TAXON.2C_TAXON_NAME

(first example SQL statement - find the taxon id of the parent taxon for 'Homo 
sapiens' using a self-join)

The BioPerl/BioSQL load_ncbi_taxonomy.pl script however does not match this 
description.

cheers,
Richard

On 12 Apr 2010, at 07:57, Richard Holland wrote:

> Thanks Deepak. 
> 
> I've had a look at the code and I believe its due to the different ways in 
> which BioJava and BioPerl load the taxon table. 
> 
> BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the 
> values from the NCBI taxonomy file. The taxon_id column in BioJava is a 
> meaningless auto-generated value that is never used.
> 
> BioPerl however is generating taxon_id values and linking them by setting 
> parent_taxon_id to the generated value. The parent value from the NCBI 
> taxonomy file is therefore replaced with the BioPerl generated parent ID, 
> meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per 
> BioJava, the link is to taxon_id instead. (I'm basing this comment on looking 
> at load_ncbi_taxonomy.pl from the BioSQL archives.)
> 
> I believe if you load the taxonomy table using BioJava, you should see 
> BioJava giving correct behaviour. Likewise if you load it using BioPerl, 
> BioPerl will behave correctly. But if you load with one then query with the 
> other, you'll get incorrect results.
> 
> This sounds like a case for discussion on both lists - a matter of 
> standardisation between the two projects. Not quickly/easily solvable for now.
> 
> cheers,
> Richard
> 
> On 11 Apr 2010, at 22:08, Deepak Sheoran wrote:
> 
>> I am using same table with biojava and bioperl taxon program and the output 
>> I get is below:
>> 
>> Biojava:
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i 
>> get is 
>>            Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia 
>> australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. 
>> haydenii. 
>> 
>> Biojava process of finding names: 
>> 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   
>> (wrong way of doing things)
>> 
>> Bioperl:    
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i 
>> get is 
>>          Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified  
>> Alpharetrovirus.
>> 
>> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632   
>> (Right way of doing things)
>> 
>> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id 
>> where bioperl search taxon_id column with a value from parent_taxon_id.
>> 
>> Taxon and Taxon_name Table content which is being relevant  in discussion:
>> 
>> taxon_id     ncbi_taxon_id   parent_taxon_id node_rank       name    
>> name_class
>> 2901 3609    276240  genus   Rhamnus scientific name
>> 3610 4403    3609    species Platanus occidentalis   scientific name
>> 29052        48579   4403    species Suillus placidus        scientific name
>> 114412       143975  48579   species Diadasia australis      scientific name
>> 143976       176516  143975  species Arnicastrum guerrerense scientific name
>> 30680        50447   176516  family  Labiduridae     scientific name
>> 254757       301952  50447   varietas        Oreostemma alpigenum var. 
>> haydenii      scientific name
>> 9394 11632   17394   family  Retroviridae    scientific name
>> 277861       327045  9394    subfamily       Orthoretrovirinae       
>> scientific name
>> 122448       153057  277861  genus   Alpharetrovirus scientific name
>> 301952       353825  122448  no rank unclassified Alpharetrovirus    
>> scientific name
>> 9584
>> 11876
>> 301952
>> species
>> Avian sarcoma virus
>> scientifice name
>> 
>> Thanks
>> Deepak 
>> 
>> On 4/11/2010 2:53 PM, Richard Holland wrote:
>>> I'm sorry but I don't understand your example. Could you provide a real 
>>> example of correct values for each column from a sample taxon entry in 
>>> NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample 
>>> record to use as reference, then point out the correct value of 
>>> parent_taxon_id, and point out what value BioJava is using instead).
>>> 
>>> thanks,
>>> Richard
>>> 
>>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>>> 
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which 
>>>> it is producing wrong taxonomy hierarchy. I am explaing what I have found 
>>>> let me what you guys think of it, and me suggest how to fix it.
>>>> 
>>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, 
>>>> nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have 
>>>> parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. 
>>>> The value which "parent_taxon_id" have is "taxon_id" which have 
>>>> parent_ncbi_taxon_id of current ncbi_taxon_id.
>>>> 
>>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>>> <property name="nodeRank" column="node_rank"/>
>>>> <property name="geneticCode" column="genetic_code"/>
>>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>>> <property name="leftValue" column="left_value"/>
>>>> <property name="rightValue" column="right_value"/>
>>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its 
>>>> not correct column parent_taxon_id stores the taxon_id which have 
>>>> parent_ncbi_taxon_id for current entry
>>>> 
>>>> Thanks
>>>> Deepak Sheoran
>>>> 
>>>> 
>>>> 
>>>> 
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: 
>>> [email protected]
>>> http://www.eaglegenomics.com/
>>> 
>>> 
>>> 
>>> 
>> 
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: [email protected]
> http://www.eaglegenomics.com/
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  [email protected]
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: [email protected]
http://www.eaglegenomics.com/


_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to