Re: [Denovoassembler-users] Question about proportions values in ray-meta profiling

Sébastien Boisvert Mon, 22 Apr 2013 13:03:31 -0700

On 22/04/13 12:57 PM, Egon Ozer wrote:
> I'm also observing that gene ontology proportions aren't adding up to 1.  
> These are on runs in Ray v2.1.0 as I have not yet had time to re-run on 
> v2.2.0.  I didn't see anything addressing this issue in the changes from 
> v2.1.0 to v2.2.0 you posted to the list, so I thought I'd check with you 
> before devoting the time and resources to re-running the data on the newest 
> version.
>
> When I sum up all the Proportions in the 
> BiologicalAbundances/_GeneOntology/Terms.tsv file, the result is 5.611734...  
> Sums from the 0.Profile.GeneOntologyDomain=biological_process.tsv, 
> ...molecular_function.tsv, and ...cellular_component.tsv files are 1.78, 
> 3.03, and 8.11, respectively.  Is this perhaps because the same kmer is 
> mapping to multiple GeneOntology terms?


For greater-than one values, there is this ticket:

     https://github.com/sebhtml/ray/issues/158

I has something to do with a recursive sum wherein a kmer contributes to more 
than one ontology term
on itsd path from the root to the gene ontology term (a leaf).

>
> Just to mention, in the _GeneOntology folder, I'm noting that some of the 
> terms within the files show very large numbers of negative observations (i.e. 
> -1.623E+09 out of 1929700676
>total observations).  I know you mentioned below that the files in the 
>_GeneOntology folder were experimental and you were considering not including 
>them in future releases, but
>I wanted to point this out.  There are no negative proportions in the 
>Terms.tsv file.

Sounds like a bug.


Can you create a ticket describing the problem:

     https://github.com/sebhtml/ray/issues/new

>
> Thanks for the help,
> - Egon
>

Thanks.

>
> On Feb 14, 2013, at 1:56 PM, Sébastien Boisvert wrote:
>
>> [Please C.C. the mailing list]
>>
>> Hi james,
>>
>> On 02/14/2013 01:37 PM, James Vincent wrote:
>>> Sébastien,
>>>
>>> Please pardon me for being obtuse,
>>
>> We call this science ;-)
>>
>> I do have a good explanation for this -- this is a known problem with Gene 
>> Ontology and you should
>> stick to proportions that are not seated at a given depth which are reported 
>> in BiologicalAbundances/_GenOntology/Terms.xml
>> (also available as Terms.tsv).
>>
>> See below.
>>
>>> but I do not see what you describe
>>> in output files for even one level. I have attached the GeneOntology
>>> output for one level of molecular function as an example.  When I sort
>>> this file by proportion the single largest proportion looks like this:
>>>
>>> #Identifier  Name    Proportion      Observations    Total
>>> GO:0016462   pyrophosphatase activity        4.69785 352616  75059
>>
>> In the documentation, only these files are described:
>>
>> In Documentation/GeneOntology.txt
>>
>>          <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.xml
>>          <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.tsv
>>
>>
>> Each GO term has a depth (in fact, each term can have many paths to the 
>> root, each with a
>> possibly different depth).
>>
>> If you take a taxonomy tree, k-mers are attached to leaves (genomes). That's 
>> a nice thing
>> because k-mers can be classified at different level without reusing 
>> biological signal.
>>
>> In Gene Ontology, k-mers are attached to any term at any level.
>>
>> There can be a given k-mer that is attached to all the terms from the root to
>> a particular term.
>>
>> In the design of Gene Ontology, each term can have an arbitrary number of 
>> parents in
>> the directed acyclic graph.
>>
>> If you use Terms.tsv or Terms.xml or 0.Profile.GeneOntologyDomain.*, you 
>> should not be getting this behavior.
>>
>> But if you use the counts in the files for specific depths, then the count 
>> is recursive. Really often, the
>> ratio "recursive count / total" is devoid of sense (that's why our 
>> documentation points to Terms.xml).
>>
>>>
>>> The proportion number is 4.69, with 352,616 observations out of only
>>> 75,059 total. This is the part that is confusing.
>>>
>>
>> I see.
>>
>> For a given term, a set of k-mers are associated to it. But, any children of 
>> the term can also
>> have the same k-mers attached to it.
>>
>> Therefore, the recursive counts do not make sense in some cases as the code 
>> counts the same thing over and over
>> again.
>>
>> In our paper in Genome Biology, we used Terms.xml, not the recursive counts.
>> We should probably remove the files at particular depths because sometimes 
>> the ratio are not relevant.
>>
>> The counts for each depth are more like a experimental feature.
>>
>> You should use results from Terms.xml (or Terms.tsv).
>>
>>  From our Genome Biology paper:
>>
>> "Gene ontology profiling -- The de Bruijn graph was colored with coding 
>> sequences from the EMBL nucleotide sequence database [48]
>> (EMBL CDS), which are mapped to gene ontology by transitivity using the 
>> uniprot mapping to gene ontology
>> [49]. For each ontology term, coverage depths of colored k-mers were added 
>> to obtain its total number of
>> k-mer observations."
>>
>>
>> Example of a term in the XML file:
>>
>>
>> <domain>molecular_function</domain>
>> <paths><count>1</count>
>> <path>
>> <geneOntologyTerm><identifier>GO:0003674</identifier><name>molecular_function</name></geneOntologyTerm>
>> <geneOntologyTerm><identifier>GO:0005488</identifier><name>binding</name></geneOntologyTerm>
>> </path>
>> </paths>
>> <modeKmerCoverage>2</modeKmerCoverage><meanKmerCoverage>3.34305</meanKmerCoverage>
>> <totalColoredKmerObservations>382976</totalColoredKmerObservations>
>> <proportion>0.022287</proportion>
>> <distribution>
>> #Coverage       Frequency
>> 2       62170
>> 3       37829
>> 4       8136
>> 5       3333
>> 6       1159
>> 7       540
>> 8       329
>> 9       147
>> 10      160
>> 11      89
>> 12      79
>> 13      67
>> 14      45
>>
>>
>>> Jim
>>>
>>> P.S. We love Quebec - spent our honeymoon there (oh so long ago) and
>>> camped around the peninsula.
>>>
>>
>> Cool !
>>
>> I hope I have answered as clearly as possible the behavior you are 
>> observing. If you need more information,
>> please post again on the list.
>>
>> -Sébastien
>>
>>>
>>>
>>> On Thu, Feb 14, 2013 at 12:00 PM, Sébastien Boisvert
>>> <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> On 02/14/2013 11:52 AM, James Vincent wrote:
>>>>> Hi Sébastien,
>>>>>
>>>>> Thanks very much for your quick and detailed reply.
>>>>>
>>>>>    I understand the details of proportion calculations and what they
>>>>> are, but that des not square with the output files.
>>>>>
>>>>> The sum of proportions in the file Terms.tsv, for example, is 55. It
>>>>> is not slightly off from 1. In other GO output files the sum of
>>>>> proportions is a similarly large number, 50, 60 or more.
>>>>
>>>> The file Terms.tsv contains all levels of depth in the directed acyclic 
>>>> graph
>>>> of Gene Ontology.
>>>>
>>>> If you take a particular depth, you should see something near 100%.
>>>>
>>>> Relevant files:
>>>>
>>>> $ ls|grep GeneO
>>>> 0.Profile.GeneOntologyDomain=biological_process.tsv
>>>> 0.Profile.GeneOntologyDomain=cellular_component.tsv
>>>> 0.Profile.GeneOntologyDomain=molecular_function.tsv
>>>> _GeneOntology
>>>>
>>>> $ ls _GeneOntology/
>>>> biological_process.Depth=0.tsv  cellular_component.Depth=4.tsv  
>>>> molecular_function.Depth=1.tsv  molecular_function.Depth=8.tsv
>>>> biological_process.Depth=1.tsv  cellular_component.Depth=5.tsv  
>>>> molecular_function.Depth=2.tsv  molecular_function.Depth=9.tsv
>>>> biological_process.Depth=2.tsv  cellular_component.Depth=6.tsv  
>>>> molecular_function.Depth=3.tsv  Terms.tsv
>>>> cellular_component.Depth=0.tsv  cellular_component.Depth=7.tsv  
>>>> molecular_function.Depth=4.tsv  Terms.xml
>>>> cellular_component.Depth=1.tsv  cellular_component.Depth=8.tsv  
>>>> molecular_function.Depth=5.tsv
>>>> cellular_component.Depth=2.tsv  cellular_component.Depth=9.tsv  
>>>> molecular_function.Depth=6.tsv
>>>> cellular_component.Depth=3.tsv  molecular_function.Depth=0.tsv  
>>>> molecular_function.Depth=7.tsv
>>>>
>>>>>
>>>>> The obvious examples is that the first few largest proportion numbers
>>>>> add up to more than 2. They are all fractions like 0.5, 0.6 and so on.
>>>>> Is there an error in my run or perhaps my interpretation?
>>>>>
>>>>
>>>> Do you see this behavior if you look at a given depth and not at all the 
>>>> depths at once ?
>>>>
>>>>> Merci,
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>>> I expected the
>>>>>>> sum to be 1.0.
>>>>>>
>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>>>> it's a little bit less. This is
>>>>>> because the demultiplexing process is not 100% accurate, but in general 
>>>>>> it really good.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 14, 2013 at 10:59 AM, Sébastien Boisvert
>>>>> <[email protected]> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On 02/14/2013 10:13 AM, jjv5 wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have used ray-meta with -gene-ontology enabled after downloading GO
>>>>>>> data using the Main.sh script in the git repo. Everything completed
>>>>>>> fine and produced
>>>>>>> expected output.
>>>>>>>
>>>>>>> The result file Terms.tsv under BiologicalAbundances/_GeneOntology
>>>>>>> contains proportions for the GO terms encountered. What is this
>>>>>>> proportion number based on?
>>>>>>
>>>>>> For plain genomes (via the -search command), proportion are computed by
>>>>>> demultiplexing the signal based on uniquely colored kmers.
>>>>>>
>>>>>> For taxonomy, the provided taxonomy tree is used to classify each 
>>>>>> observed kmer
>>>>>> at the vertex in the tree where the earliest common ancestor is found.
>>>>>>
>>>>>> For gene ontology, kmer observations are gathered for each ontology 
>>>>>> term, and proportions
>>>>>> are computed for each depth in the gene ontology directed acyclic graph.
>>>>>>
>>>>>>> Proportion of what?
>>>>>>
>>>>>> Of k-mers found in the de Bruijn subgraph that was built from the 
>>>>>> sequence reads
>>>>>> provided to Ray.
>>>>>>
>>>>>> For example, if you want a number of bacterial cells, you need to 
>>>>>> further normalize
>>>>>> by genome length, and so on.
>>>>>>
>>>>>>> The sum of the
>>>>>>> proportion values in this file is some large integer.
>>>>>>
>>>>>> In directories in BiologicalAbundances, a file called 
>>>>>> SequenceAbundances.xml contain
>>>>>> numerous counts.
>>>>>>
>>>>>> These large integers are either a number of k-mers, or a number of k-mer 
>>>>>> observations.
>>>>>> A k-mer observation corresponds to a k-mer occurring 1 time.
>>>>>>
>>>>>> So for a life form X, its kmer observations are computed as follows:
>>>>>>
>>>>>> 1. Gather the k-mers that are unique (specific) to this life form X;
>>>>>> 2. Compute a average number of observations (depth) for these objects;
>>>>>> 3. For life form X, compute the number of matched k-mers in the graph, 
>>>>>> regardless if they are unique (breadth);
>>>>>> 4. We the number of matched objects (#3.) and average depth (#2.), the 
>>>>>> demultiplexed number of k-mer observations is calculated.
>>>>>>
>>>>>>> I expected the
>>>>>>> sum to be 1.0.
>>>>>>
>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>>>> it's a little bit less. This is
>>>>>> because the demultiplexing process is not 100% accurate, but in general 
>>>>>> it really good.
>>>>>>
>>>>>>      see http://genomebiology.com/2012/13/12/R122/abstract
>>>>>>
>>>>>>> Is there further documentation somewhere?
>>>>>>
>>>>>> The documentation lives mainly in 
>>>>>> https://github.com/sebhtml/ray/tree/master/Documentation
>>>>>>
>>>>>> For what you are doing, these are relevant:
>>>>>>
>>>>>> * 
>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt
>>>>>> * 
>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/NCBI-Taxonomy.txt
>>>>>> * 
>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/GeneOntology.txt
>>>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/Taxonomy.txt
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jim
>>>>>>>
>>>>>>> P.S. Thanks for making ray available. We like it a great deal.
>>>>>>>
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> It's nice to hear what our end users like (and what they don't like too 
>>>>>> !).
>>>>>>
>>>>>>
>>>>>> There is a ticket in progress to further increase the accuracy of Ray 
>>>>>> Communities (
>>>>>> the solution that tells you what's in your sample) using topology.
>>>>>>
>>>>>>        https://github.com/sebhtml/ray/issues/133
>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Free Next-Gen Firewall Hardware Offer
>>>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>>>> and get the hardware for free! Learn more.
>>>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>>>> _______________________________________________
>>>>>>> Denovoassembler-users mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Free Next-Gen Firewall Hardware Offer
>>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>>> and get the hardware for free! Learn more.
>>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>>> _______________________________________________
>>>>>> Denovoassembler-users mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Free Next-Gen Firewall Hardware Offer
>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>> and get the hardware for free! Learn more.
>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>> _______________________________________________
>>>> Denovoassembler-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>
>>
>> ------------------------------------------------------------------------------
>> Free Next-Gen Firewall Hardware Offer
>> Buy your Sophos next-gen firewall before the end March 2013
>> and get the hardware for free! Learn more.
>> http://p.sf.net/sfu/sophos-d2d-feb
>> _______________________________________________
>> Denovoassembler-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>


------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Question about proportions values in ray-meta profiling

Reply via email to