Re: [Denovoassembler-users] Question about proportions values in ray-meta profiling

Sébastien Boisvert Thu, 14 Feb 2013 11:57:45 -0800

[Please C.C. the mailing list]

Hi james,


On 02/14/2013 01:37 PM, James Vincent wrote:
> Sébastien,
>
> Please pardon me for being obtuse,

We call this science ;-)

I do have a good explanation for this -- this is a known problem with Gene 
Ontology and you should
stick to proportions that are not seated at a given depth which are reported in 
BiologicalAbundances/_GenOntology/Terms.xml
(also available as Terms.tsv).

See below.

> but I do not see what you describe
> in output files for even one level. I have attached the GeneOntology
> output for one level of molecular function as an example.  When I sort
> this file by proportion the single largest proportion looks like this:
>
> #Identifier   Name    Proportion      Observations    Total
> GO:0016462    pyrophosphatase activity        4.69785 352616  75059

In the documentation, only these files are described:

In Documentation/GeneOntology.txt

         <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.xml
         <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.tsv


Each GO term has a depth (in fact, each term can have many paths to the root, 
each with a
possibly different depth).

If you take a taxonomy tree, k-mers are attached to leaves (genomes). That's a 
nice thing
because k-mers can be classified at different level without reusing biological 
signal.

In Gene Ontology, k-mers are attached to any term at any level.

There can be a given k-mer that is attached to all the terms from the root to
a particular term.

In the design of Gene Ontology, each term can have an arbitrary number of 
parents in
the directed acyclic graph.

If you use Terms.tsv or Terms.xml or 0.Profile.GeneOntologyDomain.*, you should 
not be getting this behavior.

But if you use the counts in the files for specific depths, then the count is 
recursive. Really often, the
ratio "recursive count / total" is devoid of sense (that's why our 
documentation points to Terms.xml).

>
> The proportion number is 4.69, with 352,616 observations out of only
> 75,059 total. This is the part that is confusing.
>

I see.

For a given term, a set of k-mers are associated to it. But, any children of 
the term can also
have the same k-mers attached to it.

Therefore, the recursive counts do not make sense in some cases as the code 
counts the same thing over and over
again.

In our paper in Genome Biology, we used Terms.xml, not the recursive counts.
We should probably remove the files at particular depths because sometimes the 
ratio are not relevant.

The counts for each depth are more like a experimental feature.

You should use results from Terms.xml (or Terms.tsv).

 From our Genome Biology paper:

"Gene ontology profiling -- The de Bruijn graph was colored with coding 
sequences from the EMBL nucleotide sequence database [48]
(EMBL CDS), which are mapped to gene ontology by transitivity using the uniprot 
mapping to gene ontology
[49]. For each ontology term, coverage depths of colored k-mers were added to 
obtain its total number of
k-mer observations."


Example of a term in the XML file:


<domain>molecular_function</domain>
<paths><count>1</count>
<path>
<geneOntologyTerm><identifier>GO:0003674</identifier><name>molecular_function</name></geneOntologyTerm>
<geneOntologyTerm><identifier>GO:0005488</identifier><name>binding</name></geneOntologyTerm>
</path>
</paths>
<modeKmerCoverage>2</modeKmerCoverage><meanKmerCoverage>3.34305</meanKmerCoverage>
<totalColoredKmerObservations>382976</totalColoredKmerObservations>
<proportion>0.022287</proportion>
<distribution>
#Coverage       Frequency
2       62170
3       37829
4       8136
5       3333
6       1159
7       540
8       329
9       147
10      160
11      89
12      79
13      67
14      45


> Jim
>
> P.S. We love Quebec - spent our honeymoon there (oh so long ago) and
> camped around the peninsula.
>

Cool !

I hope I have answered as clearly as possible the behavior you are observing. 
If you need more information,
please post again on the list.

-Sébastien

>
>
> On Thu, Feb 14, 2013 at 12:00 PM, Sébastien Boisvert
> <[email protected]> wrote:
>> Hi,
>>
>> On 02/14/2013 11:52 AM, James Vincent wrote:
>>> Hi Sébastien,
>>>
>>> Thanks very much for your quick and detailed reply.
>>>
>>>    I understand the details of proportion calculations and what they
>>> are, but that des not square with the output files.
>>>
>>> The sum of proportions in the file Terms.tsv, for example, is 55. It
>>> is not slightly off from 1. In other GO output files the sum of
>>> proportions is a similarly large number, 50, 60 or more.
>>
>> The file Terms.tsv contains all levels of depth in the directed acyclic graph
>> of Gene Ontology.
>>
>> If you take a particular depth, you should see something near 100%.
>>
>> Relevant files:
>>
>> $ ls|grep GeneO
>> 0.Profile.GeneOntologyDomain=biological_process.tsv
>> 0.Profile.GeneOntologyDomain=cellular_component.tsv
>> 0.Profile.GeneOntologyDomain=molecular_function.tsv
>> _GeneOntology
>>
>> $ ls _GeneOntology/
>> biological_process.Depth=0.tsv  cellular_component.Depth=4.tsv  
>> molecular_function.Depth=1.tsv  molecular_function.Depth=8.tsv
>> biological_process.Depth=1.tsv  cellular_component.Depth=5.tsv  
>> molecular_function.Depth=2.tsv  molecular_function.Depth=9.tsv
>> biological_process.Depth=2.tsv  cellular_component.Depth=6.tsv  
>> molecular_function.Depth=3.tsv  Terms.tsv
>> cellular_component.Depth=0.tsv  cellular_component.Depth=7.tsv  
>> molecular_function.Depth=4.tsv  Terms.xml
>> cellular_component.Depth=1.tsv  cellular_component.Depth=8.tsv  
>> molecular_function.Depth=5.tsv
>> cellular_component.Depth=2.tsv  cellular_component.Depth=9.tsv  
>> molecular_function.Depth=6.tsv
>> cellular_component.Depth=3.tsv  molecular_function.Depth=0.tsv  
>> molecular_function.Depth=7.tsv
>>
>>>
>>> The obvious examples is that the first few largest proportion numbers
>>> add up to more than 2. They are all fractions like 0.5, 0.6 and so on.
>>> Is there an error in my run or perhaps my interpretation?
>>>
>>
>> Do you see this behavior if you look at a given depth and not at all the 
>> depths at once ?
>>
>>> Merci,
>>> Jim
>>>
>>>
>>>
>>>>> I expected the
>>>>> sum to be 1.0.
>>>>
>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>> it's a little bit less. This is
>>>> because the demultiplexing process is not 100% accurate, but in general it 
>>>> really good.
>>>
>>>
>>>
>>>
>>> On Thu, Feb 14, 2013 at 10:59 AM, Sébastien Boisvert
>>> <[email protected]> wrote:
>>>> Hello,
>>>>
>>>> On 02/14/2013 10:13 AM, jjv5 wrote:
>>>>> Hello,
>>>>>
>>>>> I have used ray-meta with -gene-ontology enabled after downloading GO
>>>>> data using the Main.sh script in the git repo. Everything completed
>>>>> fine and produced
>>>>> expected output.
>>>>>
>>>>> The result file Terms.tsv under BiologicalAbundances/_GeneOntology
>>>>> contains proportions for the GO terms encountered. What is this
>>>>> proportion number based on?
>>>>
>>>> For plain genomes (via the -search command), proportion are computed by
>>>> demultiplexing the signal based on uniquely colored kmers.
>>>>
>>>> For taxonomy, the provided taxonomy tree is used to classify each observed 
>>>> kmer
>>>> at the vertex in the tree where the earliest common ancestor is found.
>>>>
>>>> For gene ontology, kmer observations are gathered for each ontology term, 
>>>> and proportions
>>>> are computed for each depth in the gene ontology directed acyclic graph.
>>>>
>>>>> Proportion of what?
>>>>
>>>> Of k-mers found in the de Bruijn subgraph that was built from the sequence 
>>>> reads
>>>> provided to Ray.
>>>>
>>>> For example, if you want a number of bacterial cells, you need to further 
>>>> normalize
>>>> by genome length, and so on.
>>>>
>>>>> The sum of the
>>>>> proportion values in this file is some large integer.
>>>>
>>>> In directories in BiologicalAbundances, a file called 
>>>> SequenceAbundances.xml contain
>>>> numerous counts.
>>>>
>>>> These large integers are either a number of k-mers, or a number of k-mer 
>>>> observations.
>>>> A k-mer observation corresponds to a k-mer occurring 1 time.
>>>>
>>>> So for a life form X, its kmer observations are computed as follows:
>>>>
>>>> 1. Gather the k-mers that are unique (specific) to this life form X;
>>>> 2. Compute a average number of observations (depth) for these objects;
>>>> 3. For life form X, compute the number of matched k-mers in the graph, 
>>>> regardless if they are unique (breadth);
>>>> 4. We the number of matched objects (#3.) and average depth (#2.), the 
>>>> demultiplexed number of k-mer observations is calculated.
>>>>
>>>>> I expected the
>>>>> sum to be 1.0.
>>>>
>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>> it's a little bit less. This is
>>>> because the demultiplexing process is not 100% accurate, but in general it 
>>>> really good.
>>>>
>>>>      see http://genomebiology.com/2012/13/12/R122/abstract
>>>>
>>>>> Is there further documentation somewhere?
>>>>
>>>> The documentation lives mainly in 
>>>> https://github.com/sebhtml/ray/tree/master/Documentation
>>>>
>>>> For what you are doing, these are relevant:
>>>>
>>>> * 
>>>> https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt
>>>> * 
>>>> https://github.com/sebhtml/ray/blob/master/Documentation/NCBI-Taxonomy.txt
>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/GeneOntology.txt
>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/Taxonomy.txt
>>>>
>>>>>
>>>>> Thanks,
>>>>> Jim
>>>>>
>>>>> P.S. Thanks for making ray available. We like it a great deal.
>>>>>
>>>>
>>>> Thanks !
>>>>
>>>> It's nice to hear what our end users like (and what they don't like too !).
>>>>
>>>>
>>>> There is a ticket in progress to further increase the accuracy of Ray 
>>>> Communities (
>>>> the solution that tells you what's in your sample) using topology.
>>>>
>>>>        https://github.com/sebhtml/ray/issues/133
>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Free Next-Gen Firewall Hardware Offer
>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>> and get the hardware for free! Learn more.
>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>> _______________________________________________
>>>>> Denovoassembler-users mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Free Next-Gen Firewall Hardware Offer
>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>> and get the hardware for free! Learn more.
>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>> _______________________________________________
>>>> Denovoassembler-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>
>>
>> ------------------------------------------------------------------------------
>> Free Next-Gen Firewall Hardware Offer
>> Buy your Sophos next-gen firewall before the end March 2013
>> and get the hardware for free! Learn more.
>> http://p.sf.net/sfu/sophos-d2d-feb
>> _______________________________________________
>> Denovoassembler-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users


------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Question about proportions values in ray-meta profiling

Reply via email to