I'm also observing that gene ontology proportions aren't adding up to 1.  These 
are on runs in Ray v2.1.0 as I have not yet had time to re-run on v2.2.0.  I 
didn't see anything addressing this issue in the changes from v2.1.0 to v2.2.0 
you posted to the list, so I thought I'd check with you before devoting the 
time and resources to re-running the data on the newest version.

When I sum up all the Proportions in the 
BiologicalAbundances/_GeneOntology/Terms.tsv file, the result is 5.611734...  
Sums from the 0.Profile.GeneOntologyDomain=biological_process.tsv, 
...molecular_function.tsv, and ...cellular_component.tsv files are 1.78, 3.03, 
and 8.11, respectively.  Is this perhaps because the same kmer is mapping to 
multiple GeneOntology terms?

Just to mention, in the _GeneOntology folder, I'm noting that some of the terms 
within the files show very large numbers of negative observations (i.e. 
-1.623E+09 out of 1929700676 total observations).  I know you mentioned below 
that the files in the _GeneOntology folder were experimental and you were 
considering not including them in future releases, but I wanted to point this 
out.  There are no negative proportions in the Terms.tsv file.

Thanks for the help,
- Egon


On Feb 14, 2013, at 1:56 PM, Sébastien Boisvert wrote:

> [Please C.C. the mailing list]
> 
> Hi james,
> 
> On 02/14/2013 01:37 PM, James Vincent wrote:
>> Sébastien,
>> 
>> Please pardon me for being obtuse,
> 
> We call this science ;-)
> 
> I do have a good explanation for this -- this is a known problem with Gene 
> Ontology and you should
> stick to proportions that are not seated at a given depth which are reported 
> in BiologicalAbundances/_GenOntology/Terms.xml
> (also available as Terms.tsv).
> 
> See below.
> 
>> but I do not see what you describe
>> in output files for even one level. I have attached the GeneOntology
>> output for one level of molecular function as an example.  When I sort
>> this file by proportion the single largest proportion looks like this:
>> 
>> #Identifier  Name    Proportion      Observations    Total
>> GO:0016462   pyrophosphatase activity        4.69785 352616  75059
> 
> In the documentation, only these files are described:
> 
> In Documentation/GeneOntology.txt
> 
>         <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.xml
>         <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.tsv
> 
> 
> Each GO term has a depth (in fact, each term can have many paths to the root, 
> each with a
> possibly different depth).
> 
> If you take a taxonomy tree, k-mers are attached to leaves (genomes). That's 
> a nice thing
> because k-mers can be classified at different level without reusing 
> biological signal.
> 
> In Gene Ontology, k-mers are attached to any term at any level.
> 
> There can be a given k-mer that is attached to all the terms from the root to
> a particular term.
> 
> In the design of Gene Ontology, each term can have an arbitrary number of 
> parents in
> the directed acyclic graph.
> 
> If you use Terms.tsv or Terms.xml or 0.Profile.GeneOntologyDomain.*, you 
> should not be getting this behavior.
> 
> But if you use the counts in the files for specific depths, then the count is 
> recursive. Really often, the
> ratio "recursive count / total" is devoid of sense (that's why our 
> documentation points to Terms.xml).
> 
>> 
>> The proportion number is 4.69, with 352,616 observations out of only
>> 75,059 total. This is the part that is confusing.
>> 
> 
> I see.
> 
> For a given term, a set of k-mers are associated to it. But, any children of 
> the term can also
> have the same k-mers attached to it.
> 
> Therefore, the recursive counts do not make sense in some cases as the code 
> counts the same thing over and over
> again.
> 
> In our paper in Genome Biology, we used Terms.xml, not the recursive counts.
> We should probably remove the files at particular depths because sometimes 
> the ratio are not relevant.
> 
> The counts for each depth are more like a experimental feature.
> 
> You should use results from Terms.xml (or Terms.tsv).
> 
> From our Genome Biology paper:
> 
> "Gene ontology profiling -- The de Bruijn graph was colored with coding 
> sequences from the EMBL nucleotide sequence database [48]
> (EMBL CDS), which are mapped to gene ontology by transitivity using the 
> uniprot mapping to gene ontology
> [49]. For each ontology term, coverage depths of colored k-mers were added to 
> obtain its total number of
> k-mer observations."
> 
> 
> Example of a term in the XML file:
> 
> 
> <domain>molecular_function</domain>
> <paths><count>1</count>
> <path>
> <geneOntologyTerm><identifier>GO:0003674</identifier><name>molecular_function</name></geneOntologyTerm>
> <geneOntologyTerm><identifier>GO:0005488</identifier><name>binding</name></geneOntologyTerm>
> </path>
> </paths>
> <modeKmerCoverage>2</modeKmerCoverage><meanKmerCoverage>3.34305</meanKmerCoverage>
> <totalColoredKmerObservations>382976</totalColoredKmerObservations>
> <proportion>0.022287</proportion>
> <distribution>
> #Coverage       Frequency
> 2       62170
> 3       37829
> 4       8136
> 5       3333
> 6       1159
> 7       540
> 8       329
> 9       147
> 10      160
> 11      89
> 12      79
> 13      67
> 14      45
> 
> 
>> Jim
>> 
>> P.S. We love Quebec - spent our honeymoon there (oh so long ago) and
>> camped around the peninsula.
>> 
> 
> Cool !
> 
> I hope I have answered as clearly as possible the behavior you are observing. 
> If you need more information,
> please post again on the list.
> 
> -Sébastien
> 
>> 
>> 
>> On Thu, Feb 14, 2013 at 12:00 PM, Sébastien Boisvert
>> <[email protected]> wrote:
>>> Hi,
>>> 
>>> On 02/14/2013 11:52 AM, James Vincent wrote:
>>>> Hi Sébastien,
>>>> 
>>>> Thanks very much for your quick and detailed reply.
>>>> 
>>>>   I understand the details of proportion calculations and what they
>>>> are, but that des not square with the output files.
>>>> 
>>>> The sum of proportions in the file Terms.tsv, for example, is 55. It
>>>> is not slightly off from 1. In other GO output files the sum of
>>>> proportions is a similarly large number, 50, 60 or more.
>>> 
>>> The file Terms.tsv contains all levels of depth in the directed acyclic 
>>> graph
>>> of Gene Ontology.
>>> 
>>> If you take a particular depth, you should see something near 100%.
>>> 
>>> Relevant files:
>>> 
>>> $ ls|grep GeneO
>>> 0.Profile.GeneOntologyDomain=biological_process.tsv
>>> 0.Profile.GeneOntologyDomain=cellular_component.tsv
>>> 0.Profile.GeneOntologyDomain=molecular_function.tsv
>>> _GeneOntology
>>> 
>>> $ ls _GeneOntology/
>>> biological_process.Depth=0.tsv  cellular_component.Depth=4.tsv  
>>> molecular_function.Depth=1.tsv  molecular_function.Depth=8.tsv
>>> biological_process.Depth=1.tsv  cellular_component.Depth=5.tsv  
>>> molecular_function.Depth=2.tsv  molecular_function.Depth=9.tsv
>>> biological_process.Depth=2.tsv  cellular_component.Depth=6.tsv  
>>> molecular_function.Depth=3.tsv  Terms.tsv
>>> cellular_component.Depth=0.tsv  cellular_component.Depth=7.tsv  
>>> molecular_function.Depth=4.tsv  Terms.xml
>>> cellular_component.Depth=1.tsv  cellular_component.Depth=8.tsv  
>>> molecular_function.Depth=5.tsv
>>> cellular_component.Depth=2.tsv  cellular_component.Depth=9.tsv  
>>> molecular_function.Depth=6.tsv
>>> cellular_component.Depth=3.tsv  molecular_function.Depth=0.tsv  
>>> molecular_function.Depth=7.tsv
>>> 
>>>> 
>>>> The obvious examples is that the first few largest proportion numbers
>>>> add up to more than 2. They are all fractions like 0.5, 0.6 and so on.
>>>> Is there an error in my run or perhaps my interpretation?
>>>> 
>>> 
>>> Do you see this behavior if you look at a given depth and not at all the 
>>> depths at once ?
>>> 
>>>> Merci,
>>>> Jim
>>>> 
>>>> 
>>>> 
>>>>>> I expected the
>>>>>> sum to be 1.0.
>>>>> 
>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>>> it's a little bit less. This is
>>>>> because the demultiplexing process is not 100% accurate, but in general 
>>>>> it really good.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Feb 14, 2013 at 10:59 AM, Sébastien Boisvert
>>>> <[email protected]> wrote:
>>>>> Hello,
>>>>> 
>>>>> On 02/14/2013 10:13 AM, jjv5 wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I have used ray-meta with -gene-ontology enabled after downloading GO
>>>>>> data using the Main.sh script in the git repo. Everything completed
>>>>>> fine and produced
>>>>>> expected output.
>>>>>> 
>>>>>> The result file Terms.tsv under BiologicalAbundances/_GeneOntology
>>>>>> contains proportions for the GO terms encountered. What is this
>>>>>> proportion number based on?
>>>>> 
>>>>> For plain genomes (via the -search command), proportion are computed by
>>>>> demultiplexing the signal based on uniquely colored kmers.
>>>>> 
>>>>> For taxonomy, the provided taxonomy tree is used to classify each 
>>>>> observed kmer
>>>>> at the vertex in the tree where the earliest common ancestor is found.
>>>>> 
>>>>> For gene ontology, kmer observations are gathered for each ontology term, 
>>>>> and proportions
>>>>> are computed for each depth in the gene ontology directed acyclic graph.
>>>>> 
>>>>>> Proportion of what?
>>>>> 
>>>>> Of k-mers found in the de Bruijn subgraph that was built from the 
>>>>> sequence reads
>>>>> provided to Ray.
>>>>> 
>>>>> For example, if you want a number of bacterial cells, you need to further 
>>>>> normalize
>>>>> by genome length, and so on.
>>>>> 
>>>>>> The sum of the
>>>>>> proportion values in this file is some large integer.
>>>>> 
>>>>> In directories in BiologicalAbundances, a file called 
>>>>> SequenceAbundances.xml contain
>>>>> numerous counts.
>>>>> 
>>>>> These large integers are either a number of k-mers, or a number of k-mer 
>>>>> observations.
>>>>> A k-mer observation corresponds to a k-mer occurring 1 time.
>>>>> 
>>>>> So for a life form X, its kmer observations are computed as follows:
>>>>> 
>>>>> 1. Gather the k-mers that are unique (specific) to this life form X;
>>>>> 2. Compute a average number of observations (depth) for these objects;
>>>>> 3. For life form X, compute the number of matched k-mers in the graph, 
>>>>> regardless if they are unique (breadth);
>>>>> 4. We the number of matched objects (#3.) and average depth (#2.), the 
>>>>> demultiplexed number of k-mer observations is calculated.
>>>>> 
>>>>>> I expected the
>>>>>> sum to be 1.0.
>>>>> 
>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes 
>>>>> it's a little bit less. This is
>>>>> because the demultiplexing process is not 100% accurate, but in general 
>>>>> it really good.
>>>>> 
>>>>>     see http://genomebiology.com/2012/13/12/R122/abstract
>>>>> 
>>>>>> Is there further documentation somewhere?
>>>>> 
>>>>> The documentation lives mainly in 
>>>>> https://github.com/sebhtml/ray/tree/master/Documentation
>>>>> 
>>>>> For what you are doing, these are relevant:
>>>>> 
>>>>> * 
>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt
>>>>> * 
>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/NCBI-Taxonomy.txt
>>>>> * 
>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/GeneOntology.txt
>>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/Taxonomy.txt
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Jim
>>>>>> 
>>>>>> P.S. Thanks for making ray available. We like it a great deal.
>>>>>> 
>>>>> 
>>>>> Thanks !
>>>>> 
>>>>> It's nice to hear what our end users like (and what they don't like too 
>>>>> !).
>>>>> 
>>>>> 
>>>>> There is a ticket in progress to further increase the accuracy of Ray 
>>>>> Communities (
>>>>> the solution that tells you what's in your sample) using topology.
>>>>> 
>>>>>       https://github.com/sebhtml/ray/issues/133
>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> Free Next-Gen Firewall Hardware Offer
>>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>>> and get the hardware for free! Learn more.
>>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>>> _______________________________________________
>>>>>> Denovoassembler-users mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Free Next-Gen Firewall Hardware Offer
>>>>> Buy your Sophos next-gen firewall before the end March 2013
>>>>> and get the hardware for free! Learn more.
>>>>> http://p.sf.net/sfu/sophos-d2d-feb
>>>>> _______________________________________________
>>>>> Denovoassembler-users mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Free Next-Gen Firewall Hardware Offer
>>> Buy your Sophos next-gen firewall before the end March 2013
>>> and get the hardware for free! Learn more.
>>> http://p.sf.net/sfu/sophos-d2d-feb
>>> _______________________________________________
>>> Denovoassembler-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
> 
> 
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013 
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Denovoassembler-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users


------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to