Re: [Denovoassembler-users] Question about proportions values in ray-meta profiling

Egon Ozer Mon, 22 Apr 2013 13:25:22 -0700

Thanks for the reply.  

I see that ticket 158 also includes reference to the negative proportions, so I 
assume you don't want me to create a brand new ticket for that problem.  Please 
let me know if you want a ticket specifically for the negative proportions 
issue.


I also see that the ticket, as in the email chain below, just addresses the 
proportion-greather-than-one issue in the "Depth=" files and suggests that 
proportions should add up to 1 in "Terms.tsv," but I am seeing the issue in the 
"Terms.tsv" and the three "Domain=" files.  Should I even be expecting the sum 
of proportions to equal 1 in any of these Gene Ontology files, or will the sum 
likely always be > 1 because kmers may be able to contribute to more than one 
ontology term?  

Thanks,
- Egon


On Apr 22, 2013, at 3:03 PM, Sébastien Boisvert wrote:

> On 22/04/13 12:57 PM, Egon Ozer wrote:
>> I'm also observing that gene ontology proportions aren't adding up to 1.  
>> These are on runs in Ray v2.1.0 as I have not yet had time to re-run on 
>> v2.2.0.  I didn't see anything addressing this issue in the changes from 
>> v2.1.0 to v2.2.0 you posted to the list, so I thought I'd check with you 
>> before devoting the time and resources to re-running the data on the newest 
>> version.
>> 
>> When I sum up all the Proportions in the 
>> BiologicalAbundances/_GeneOntology/Terms.tsv file, the result is 5.611734... 
>>  Sums from the 0.Profile.GeneOntologyDomain=biological_process.tsv, 
>> ...molecular_function.tsv, and ...cellular_component.tsv files are 1.78, 
>> 3.03, and 8.11, respectively.  Is this perhaps because the same kmer is 
>> mapping to multiple GeneOntology terms?
> 
> For greater-than one values, there is this ticket:
> 
>    https://github.com/sebhtml/ray/issues/158
> 
> I has something to do with a recursive sum wherein a kmer contributes to more 
> than one ontology term
> on itsd path from the root to the gene ontology term (a leaf).
> 
>> 
>> Just to mention, in the _GeneOntology folder, I'm noting that some of the 
>> terms within the files show very large numbers of negative observations 
>> (i.e. -1.623E+09 out of 1929700676
>> total observations).  I know you mentioned below that the files in the 
>> _GeneOntology folder were experimental and you were considering not 
>> including them in future releases, but
>> I wanted to point this out.  There are no negative proportions in the 
>> Terms.tsv file.
> 
> Sounds like a bug.
> 
> 
> Can you create a ticket describing the problem:
> 
>    https://github.com/sebhtml/ray/issues/new
> 
>> 
>> Thanks for the help,
>> - Egon
>> 
> 
> Thanks.
> 
>> 
>> On Feb 14, 2013, at 1:56 PM, Sébastien Boisvert wrote:
>> 
>>> [Please C.C. the mailing list]
>>> 
>>> Hi james,
>>> 
>>> On 02/14/2013 01:37 PM, James Vincent wrote:
>>>> Sébastien,
>>>> 
>>>> Please pardon me for being obtuse,
>>> 
>>> We call this science ;-)
>>> 
>>> I do have a good explanation for this -- this is a known problem with Gene 
>>> Ontology and you should
>>> stick to proportions that are not seated at a given depth which are 
>>> reported in BiologicalAbundances/_GenOntology/Terms.xml
>>> (also available as Terms.tsv).
>>> 
>>> See below.
>>> 
>>>> but I do not see what you describe
>>>> in output files for even one level. I have attached the GeneOntology
>>>> output for one level of molecular function as an example.  When I sort
>>>> this file by proportion the single largest proportion looks like this:
>>>> 
>>>> #Identifier  Name    Proportion      Observations    Total
>>>> GO:0016462   pyrophosphatase activity        4.69785 352616  75059
>>> 
>>> In the documentation, only these files are described:
>>> 
>>> In Documentation/GeneOntology.txt
>>> 
>>>         <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.xml
>>>         <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.tsv
>>> 
>>> 
>>> Each GO term has a depth (in fact, each term can have many paths to the 
>>> root, each with a
>>> possibly different depth).
>>> 
>>> If you take a taxonomy tree, k-mers are attached to leaves (genomes). 
>>> That's a nice thing
>>> because k-mers can be classified at different level without reusing 
>>> biological signal.
>>> 
>>> In Gene Ontology, k-mers are attached to any term at any level.
>>> 
>>> There can be a given k-mer that is attached to all the terms from the root 
>>> to
>>> a particular term.
>>> 
>>> In the design of Gene Ontology, each term can have an arbitrary number of 
>>> parents in
>>> the directed acyclic graph.
>>> 
>>> If you use Terms.tsv or Terms.xml or 0.Profile.GeneOntologyDomain.*, you 
>>> should not be getting this behavior.
>>> 
>>> But if you use the counts in the files for specific depths, then the count 
>>> is recursive. Really often, the
>>> ratio "recursive count / total" is devoid of sense (that's why our 
>>> documentation points to Terms.xml).
>>> 
>>>> 
>>>> The proportion number is 4.69, with 352,616 observations out of only
>>>> 75,059 total. This is the part that is confusing.
>>>> 
>>> 
>>> I see.
>>> 
>>> For a given term, a set of k-mers are associated to it. But, any children 
>>> of the term can also
>>> have the same k-mers attached to it.
>>> 
>>> Therefore, the recursive counts do not make sense in some cases as the code 
>>> counts the same thing over and over
>>> again.
>>> 
>>> In our paper in Genome Biology, we used Terms.xml, not the recursive counts.
>>> We should probably remove the files at particular depths because sometimes 
>>> the ratio are not relevant.
>>> 
>>> The counts for each depth are more like a experimental feature.
>>> 
>>> You should use results from Terms.xml (or Terms.tsv).
>>> 
>>> From our Genome Biology paper:
>>> 
>>> "Gene ontology profiling -- The de Bruijn graph was colored with coding 
>>> sequences from the EMBL nucleotide sequence database [48]
>>> (EMBL CDS), which are mapped to gene ontology by transitivity using the 
>>> uniprot mapping to gene ontology
>>> [49]. For each ontology term, coverage depths of colored k-mers were added 
>>> to obtain its total number of
>>> k-mer observations."
>>> 
>>> 
>>> Example of a term in the XML file:
>>> 
>>> 
>>> <domain>molecular_function</domain>
>>> <paths><count>1</count>
>>> <path>
>>> <geneOntologyTerm><identifier>GO:0003674</identifier><name>molecular_function</name></geneOntologyTerm>
>>> <geneOntologyTerm><identifier>GO:0005488</identifier><name>binding</name></geneOntologyTerm>
>>> </path>
>>> </paths>
>>> <modeKmerCoverage>2</modeKmerCoverage><meanKmerCoverage>3.34305</meanKmerCoverage>
>>> <totalColoredKmerObservations>382976</totalColoredKmerObservations>
>>> <proportion>0.022287</proportion>
>>> <distribution>
>>> #Coverage       Frequency
>>> 2       62170
>>> 3       37829
>>> 4       8136
>>> 5       3333
>>> 6       1159
>>> 7       540
>>> 8       329
>>> 9       147
>>> 10      160
>>> 11      89
>>> 12      79
>>> 13      67
>>> 14      45
>>> 
>>> 
>>>> Jim
>>>> 
>>>> P.S. We love Quebec - spent our honeymoon there (oh so long ago) and
>>>> camped around the peninsula.
>>>> 
>>> 
>>> Cool !
>>> 
>>> I hope I have answered as clearly as possible the behavior you are 
>>> observing. If you need more information,
>>> please post again on the list.
>>> 
>>> -Sébastien
>>> 
>>>> 
>>>> 
>>>> On Thu, Feb 14, 2013 at 12:00 PM, Sébastien Boisvert
>>>> <[email protected]> wrote:
>>>>> Hi,
>>>>> 
>>>>> On 02/14/2013 11:52 AM, James Vincent wrote:
>>>>>> Hi Sébastien,
>>>>>> 
>>>>>> Thanks very much for your quick and detailed reply.
>>>>>> 
>>>>>>   I understand the details of proportion calculations and what they
>>>>>> are, but that des not square with the output files.
>>>>>> 
>>>>>> The sum of proportions in the file Terms.tsv, for example, is 55. It
>>>>>> is not slightly off from 1. In other GO output files the sum of
>>>>>> proportions is a similarly large number, 50, 60 or more.
>>>>> 
>>>>> The file Terms.tsv contains all levels of depth in the directed acyclic 
>>>>> graph
>>>>> of Gene Ontology.
>>>>> 
>>>>> If you take a particular depth, you should see something near 100%.
>>>>> 
>>>>> Relevant files:
>>>>> 
>>>>> $ ls|grep GeneO
>>>>> 0.Profile.GeneOntologyDomain=biological_process.tsv
>>>>> 0.Profile.GeneOntologyDomain=cellular_component.tsv
>>>>> 0.Profile.GeneOntologyDomain=molecular_function.tsv
>>>>> _GeneOntology
>>>>> 
>>>>> $ ls _GeneOntology/
>>>>> biological_process.Depth=0.tsv  cellular_component.Depth=4.tsv  
>>>>> molecular_function.Depth=1.tsv  molecular_function.Depth=8.tsv
>>>>> biological_process.Depth=1.tsv  cellular_component.Depth=5.tsv  
>>>>> molecular_function.Depth=2.tsv  molecular_function.Depth=9.tsv
>>>>> biological_process.Depth=2.tsv  cellular_component.Depth=6.tsv  
>>>>> molecular_function.Depth=3.tsv  Terms.tsv
>>>>> cellular_component.Depth=0.tsv  cellular_component.Depth=7.tsv  
>>>>> molecular_function.Depth=4.tsv  Terms.xml
>>>>> cellular_component.Depth=1.tsv  cellular_component.Depth=8.tsv  
>>>>> molecular_function.Depth=5.tsv
>>>>> cellular_component.Depth=2.tsv  cellular_component.Depth=9.tsv  
>>>>> molecular_function.Depth=6.tsv
>>>>> cellular_component.Depth=3.tsv  molecular_function.Depth=0.tsv  
>>>>> molecular_function.Depth=7.tsv
>>>>> 
>>>>>> 
>>>>>> The obvious examples is that the first few largest proportion numbers
>>>>>> add up to more than 2. They are all fractions like 0.5, 0.6 and so on.
>>>>>> Is there an error in my run or perhaps my interpretation?
>>>>>> 
>>>>> 
>>>>> Do you see this behavior if you look at a given depth and not at all the 
>>>>> depths at once ?
>>>>> 
>>>>>> Merci,
>>>>>> Jim
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> I expected the
>>>>>>>> sum to be 1.0.
>>>>>>> 
>>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>>>>> it's a little bit less. This is
>>>>>>> because the demultiplexing process is not 100% accurate, but in general 
>>>>>>> it really good.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Feb 14, 2013 at 10:59 AM, Sébastien Boisvert
>>>>>> <[email protected]> wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> On 02/14/2013 10:13 AM, jjv5 wrote:
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> I have used ray-meta with -gene-ontology enabled after downloading GO
>>>>>>>> data using the Main.sh script in the git repo. Everything completed
>>>>>>>> fine and produced
>>>>>>>> expected output.
>>>>>>>> 
>>>>>>>> The result file Terms.tsv under BiologicalAbundances/_GeneOntology
>>>>>>>> contains proportions for the GO terms encountered. What is this
>>>>>>>> proportion number based on?
>>>>>>> 
>>>>>>> For plain genomes (via the -search command), proportion are computed by
>>>>>>> demultiplexing the signal based on uniquely colored kmers.
>>>>>>> 
>>>>>>> For taxonomy, the provided taxonomy tree is used to classify each 
>>>>>>> observed kmer
>>>>>>> at the vertex in the tree where the earliest common ancestor is found.
>>>>>>> 
>>>>>>> For gene ontology, kmer observations are gathered for each ontology 
>>>>>>> term, and proportions
>>>>>>> are computed for each depth in the gene ontology directed acyclic graph.
>>>>>>> 
>>>>>>>> Proportion of what?
>>>>>>> 
>>>>>>> Of k-mers found in the de Bruijn subgraph that was built from the 
>>>>>>> sequence reads
>>>>>>> provided to Ray.
>>>>>>> 
>>>>>>> For example, if you want a number of bacterial cells, you need to 
>>>>>>> further normalize
>>>>>>> by genome length, and so on.
>>>>>>> 
>>>>>>>> The sum of the
>>>>>>>> proportion values in this file is some large integer.
>>>>>>> 
>>>>>>> In directories in BiologicalAbundances, a file called 
>>>>>>> SequenceAbundances.xml contain
>>>>>>> numerous counts.
>>>>>>> 
>>>>>>> These large integers are either a number of k-mers, or a number of 
>>>>>>> k-mer observations.
>>>>>>> A k-mer observation corresponds to a k-mer occurring 1 time.
>>>>>>> 
>>>>>>> So for a life form X, its kmer observations are computed as follows:
>>>>>>> 
>>>>>>> 1. Gather the k-mers that are unique (specific) to this life form X;
>>>>>>> 2. Compute a average number of observations (depth) for these objects;
>>>>>>> 3. For life form X, compute the number of matched k-mers in the graph, 
>>>>>>> regardless if they are unique (breadth);
>>>>>>> 4. We the number of matched objects (#3.) and average depth (#2.), the 
>>>>>>> demultiplexed number of k-mer observations is calculated.
>>>>>>> 
>>>>>>>> I expected the
>>>>>>>> sum to be 1.0.
>>>>>>> 
>>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>>>>> it's a little bit less. This is
>>>>>>> because the demultiplexing process is not 100% accurate, but in general 
>>>>>>> it really good.
>>>>>>> 
>>>>>>>     see http://genomebiology.com/2012/13/12/R122/abstract
>>>>>>> 
>>>>>>>> Is there further documentation somewhere?
>>>>>>> 
>>>>>>> The documentation lives mainly in 
>>>>>>> https://github.com/sebhtml/ray/tree/master/Documentation
>>>>>>> 
>>>>>>> For what you are doing, these are relevant:
>>>>>>> 
>>>>>>> * 
>>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt
>>>>>>> * 
>>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/NCBI-Taxonomy.txt
>>>>>>> * 
>>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/GeneOntology.txt
>>>>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/Taxonomy.txt
>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Jim
>>>>>>>> 
>>>>>>>> P.S. Thanks for making ray available. We like it a great deal.
>>>>>>>> 
>>>>>>> 
>>>>>>> Thanks !
>>>>>>> 
>>>>>>> It's nice to hear what our end users like (and what they don't like too 
>>>>>>> !).
>>>>>>> 
>>>>>>> 
>>>>>>> There is a ticket in progress to further increase the accuracy of Ray 
>>>>>>> Communities (
>>>>>>> the solution that tells you what's in your sample) using topology.
>>>>>>> 
>>>>>>>       https://github.com/sebhtml/ray/issues/133
>>>>>>> 
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Free Next-Gen Firewall Hardware Offer
>>>>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>>>>> and get the hardware for free! Learn more.
>>>>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>>>>> _______________________________________________
>>>>>>>> Denovoassembler-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Free Next-Gen Firewall Hardware Offer
>>>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>>>> and get the hardware for free! Learn more.
>>>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>>>> _______________________________________________
>>>>>>> Denovoassembler-users mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Free Next-Gen Firewall Hardware Offer
>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>> and get the hardware for free! Learn more.
>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>> _______________________________________________
>>>>> Denovoassembler-users mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Free Next-Gen Firewall Hardware Offer
>>> Buy your Sophos next-gen firewall before the end March 2013
>>> and get the hardware for free! Learn more.
>>> http://p.sf.net/sfu/sophos-d2d-feb
>>> _______________________________________________
>>> Denovoassembler-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>> 
> 


------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Question about proportions values in ray-meta profiling

Reply via email to