I'm also observing that gene ontology proportions aren't adding up to 1. These are on runs in Ray v2.1.0 as I have not yet had time to re-run on v2.2.0. I didn't see anything addressing this issue in the changes from v2.1.0 to v2.2.0 you posted to the list, so I thought I'd check with you before devoting the time and resources to re-running the data on the newest version.
When I sum up all the Proportions in the BiologicalAbundances/_GeneOntology/Terms.tsv file, the result is 5.611734... Sums from the 0.Profile.GeneOntologyDomain=biological_process.tsv, ...molecular_function.tsv, and ...cellular_component.tsv files are 1.78, 3.03, and 8.11, respectively. Is this perhaps because the same kmer is mapping to multiple GeneOntology terms? Just to mention, in the _GeneOntology folder, I'm noting that some of the terms within the files show very large numbers of negative observations (i.e. -1.623E+09 out of 1929700676 total observations). I know you mentioned below that the files in the _GeneOntology folder were experimental and you were considering not including them in future releases, but I wanted to point this out. There are no negative proportions in the Terms.tsv file. Thanks for the help, - Egon On Feb 14, 2013, at 1:56 PM, Sébastien Boisvert wrote: > [Please C.C. the mailing list] > > Hi james, > > On 02/14/2013 01:37 PM, James Vincent wrote: >> Sébastien, >> >> Please pardon me for being obtuse, > > We call this science ;-) > > I do have a good explanation for this -- this is a known problem with Gene > Ontology and you should > stick to proportions that are not seated at a given depth which are reported > in BiologicalAbundances/_GenOntology/Terms.xml > (also available as Terms.tsv). > > See below. > >> but I do not see what you describe >> in output files for even one level. I have attached the GeneOntology >> output for one level of molecular function as an example. When I sort >> this file by proportion the single largest proportion looks like this: >> >> #Identifier Name Proportion Observations Total >> GO:0016462 pyrophosphatase activity 4.69785 352616 75059 > > In the documentation, only these files are described: > > In Documentation/GeneOntology.txt > > <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.xml > <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.tsv > > > Each GO term has a depth (in fact, each term can have many paths to the root, > each with a > possibly different depth). > > If you take a taxonomy tree, k-mers are attached to leaves (genomes). That's > a nice thing > because k-mers can be classified at different level without reusing > biological signal. > > In Gene Ontology, k-mers are attached to any term at any level. > > There can be a given k-mer that is attached to all the terms from the root to > a particular term. > > In the design of Gene Ontology, each term can have an arbitrary number of > parents in > the directed acyclic graph. > > If you use Terms.tsv or Terms.xml or 0.Profile.GeneOntologyDomain.*, you > should not be getting this behavior. > > But if you use the counts in the files for specific depths, then the count is > recursive. Really often, the > ratio "recursive count / total" is devoid of sense (that's why our > documentation points to Terms.xml). > >> >> The proportion number is 4.69, with 352,616 observations out of only >> 75,059 total. This is the part that is confusing. >> > > I see. > > For a given term, a set of k-mers are associated to it. But, any children of > the term can also > have the same k-mers attached to it. > > Therefore, the recursive counts do not make sense in some cases as the code > counts the same thing over and over > again. > > In our paper in Genome Biology, we used Terms.xml, not the recursive counts. > We should probably remove the files at particular depths because sometimes > the ratio are not relevant. > > The counts for each depth are more like a experimental feature. > > You should use results from Terms.xml (or Terms.tsv). > > From our Genome Biology paper: > > "Gene ontology profiling -- The de Bruijn graph was colored with coding > sequences from the EMBL nucleotide sequence database [48] > (EMBL CDS), which are mapped to gene ontology by transitivity using the > uniprot mapping to gene ontology > [49]. For each ontology term, coverage depths of colored k-mers were added to > obtain its total number of > k-mer observations." > > > Example of a term in the XML file: > > > <domain>molecular_function</domain> > <paths><count>1</count> > <path> > <geneOntologyTerm><identifier>GO:0003674</identifier><name>molecular_function</name></geneOntologyTerm> > <geneOntologyTerm><identifier>GO:0005488</identifier><name>binding</name></geneOntologyTerm> > </path> > </paths> > <modeKmerCoverage>2</modeKmerCoverage><meanKmerCoverage>3.34305</meanKmerCoverage> > <totalColoredKmerObservations>382976</totalColoredKmerObservations> > <proportion>0.022287</proportion> > <distribution> > #Coverage Frequency > 2 62170 > 3 37829 > 4 8136 > 5 3333 > 6 1159 > 7 540 > 8 329 > 9 147 > 10 160 > 11 89 > 12 79 > 13 67 > 14 45 > > >> Jim >> >> P.S. We love Quebec - spent our honeymoon there (oh so long ago) and >> camped around the peninsula. >> > > Cool ! > > I hope I have answered as clearly as possible the behavior you are observing. > If you need more information, > please post again on the list. > > -Sébastien > >> >> >> On Thu, Feb 14, 2013 at 12:00 PM, Sébastien Boisvert >> <[email protected]> wrote: >>> Hi, >>> >>> On 02/14/2013 11:52 AM, James Vincent wrote: >>>> Hi Sébastien, >>>> >>>> Thanks very much for your quick and detailed reply. >>>> >>>> I understand the details of proportion calculations and what they >>>> are, but that des not square with the output files. >>>> >>>> The sum of proportions in the file Terms.tsv, for example, is 55. It >>>> is not slightly off from 1. In other GO output files the sum of >>>> proportions is a similarly large number, 50, 60 or more. >>> >>> The file Terms.tsv contains all levels of depth in the directed acyclic >>> graph >>> of Gene Ontology. >>> >>> If you take a particular depth, you should see something near 100%. >>> >>> Relevant files: >>> >>> $ ls|grep GeneO >>> 0.Profile.GeneOntologyDomain=biological_process.tsv >>> 0.Profile.GeneOntologyDomain=cellular_component.tsv >>> 0.Profile.GeneOntologyDomain=molecular_function.tsv >>> _GeneOntology >>> >>> $ ls _GeneOntology/ >>> biological_process.Depth=0.tsv cellular_component.Depth=4.tsv >>> molecular_function.Depth=1.tsv molecular_function.Depth=8.tsv >>> biological_process.Depth=1.tsv cellular_component.Depth=5.tsv >>> molecular_function.Depth=2.tsv molecular_function.Depth=9.tsv >>> biological_process.Depth=2.tsv cellular_component.Depth=6.tsv >>> molecular_function.Depth=3.tsv Terms.tsv >>> cellular_component.Depth=0.tsv cellular_component.Depth=7.tsv >>> molecular_function.Depth=4.tsv Terms.xml >>> cellular_component.Depth=1.tsv cellular_component.Depth=8.tsv >>> molecular_function.Depth=5.tsv >>> cellular_component.Depth=2.tsv cellular_component.Depth=9.tsv >>> molecular_function.Depth=6.tsv >>> cellular_component.Depth=3.tsv molecular_function.Depth=0.tsv >>> molecular_function.Depth=7.tsv >>> >>>> >>>> The obvious examples is that the first few largest proportion numbers >>>> add up to more than 2. They are all fractions like 0.5, 0.6 and so on. >>>> Is there an error in my run or perhaps my interpretation? >>>> >>> >>> Do you see this behavior if you look at a given depth and not at all the >>> depths at once ? >>> >>>> Merci, >>>> Jim >>>> >>>> >>>> >>>>>> I expected the >>>>>> sum to be 1.0. >>>>> >>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes >>>>> it's a little bit less. This is >>>>> because the demultiplexing process is not 100% accurate, but in general >>>>> it really good. >>>> >>>> >>>> >>>> >>>> On Thu, Feb 14, 2013 at 10:59 AM, Sébastien Boisvert >>>> <[email protected]> wrote: >>>>> Hello, >>>>> >>>>> On 02/14/2013 10:13 AM, jjv5 wrote: >>>>>> Hello, >>>>>> >>>>>> I have used ray-meta with -gene-ontology enabled after downloading GO >>>>>> data using the Main.sh script in the git repo. Everything completed >>>>>> fine and produced >>>>>> expected output. >>>>>> >>>>>> The result file Terms.tsv under BiologicalAbundances/_GeneOntology >>>>>> contains proportions for the GO terms encountered. What is this >>>>>> proportion number based on? >>>>> >>>>> For plain genomes (via the -search command), proportion are computed by >>>>> demultiplexing the signal based on uniquely colored kmers. >>>>> >>>>> For taxonomy, the provided taxonomy tree is used to classify each >>>>> observed kmer >>>>> at the vertex in the tree where the earliest common ancestor is found. >>>>> >>>>> For gene ontology, kmer observations are gathered for each ontology term, >>>>> and proportions >>>>> are computed for each depth in the gene ontology directed acyclic graph. >>>>> >>>>>> Proportion of what? >>>>> >>>>> Of k-mers found in the de Bruijn subgraph that was built from the >>>>> sequence reads >>>>> provided to Ray. >>>>> >>>>> For example, if you want a number of bacterial cells, you need to further >>>>> normalize >>>>> by genome length, and so on. >>>>> >>>>>> The sum of the >>>>>> proportion values in this file is some large integer. >>>>> >>>>> In directories in BiologicalAbundances, a file called >>>>> SequenceAbundances.xml contain >>>>> numerous counts. >>>>> >>>>> These large integers are either a number of k-mers, or a number of k-mer >>>>> observations. >>>>> A k-mer observation corresponds to a k-mer occurring 1 time. >>>>> >>>>> So for a life form X, its kmer observations are computed as follows: >>>>> >>>>> 1. Gather the k-mers that are unique (specific) to this life form X; >>>>> 2. Compute a average number of observations (depth) for these objects; >>>>> 3. For life form X, compute the number of matched k-mers in the graph, >>>>> regardless if they are unique (breadth); >>>>> 4. We the number of matched objects (#3.) and average depth (#2.), the >>>>> demultiplexed number of k-mer observations is calculated. >>>>> >>>>>> I expected the >>>>>> sum to be 1.0. >>>>> >>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes >>>>> it's a little bit less. This is >>>>> because the demultiplexing process is not 100% accurate, but in general >>>>> it really good. >>>>> >>>>> see http://genomebiology.com/2012/13/12/R122/abstract >>>>> >>>>>> Is there further documentation somewhere? >>>>> >>>>> The documentation lives mainly in >>>>> https://github.com/sebhtml/ray/tree/master/Documentation >>>>> >>>>> For what you are doing, these are relevant: >>>>> >>>>> * >>>>> https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt >>>>> * >>>>> https://github.com/sebhtml/ray/blob/master/Documentation/NCBI-Taxonomy.txt >>>>> * >>>>> https://github.com/sebhtml/ray/blob/master/Documentation/GeneOntology.txt >>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/Taxonomy.txt >>>>> >>>>>> >>>>>> Thanks, >>>>>> Jim >>>>>> >>>>>> P.S. Thanks for making ray available. We like it a great deal. >>>>>> >>>>> >>>>> Thanks ! >>>>> >>>>> It's nice to hear what our end users like (and what they don't like too >>>>> !). >>>>> >>>>> >>>>> There is a ticket in progress to further increase the accuracy of Ray >>>>> Communities ( >>>>> the solution that tells you what's in your sample) using topology. >>>>> >>>>> https://github.com/sebhtml/ray/issues/133 >>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Free Next-Gen Firewall Hardware Offer >>>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>>> and get the hardware for free! Learn more. >>>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>>> _______________________________________________ >>>>>> Denovoassembler-users mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Free Next-Gen Firewall Hardware Offer >>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>> and get the hardware for free! Learn more. >>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>> _______________________________________________ >>>>> Denovoassembler-users mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>> >>> >>> ------------------------------------------------------------------------------ >>> Free Next-Gen Firewall Hardware Offer >>> Buy your Sophos next-gen firewall before the end March 2013 >>> and get the hardware for free! Learn more. >>> http://p.sf.net/sfu/sophos-d2d-feb >>> _______________________________________________ >>> Denovoassembler-users mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users > > > ------------------------------------------------------------------------------ > Free Next-Gen Firewall Hardware Offer > Buy your Sophos next-gen firewall before the end March 2013 > and get the hardware for free! Learn more. > http://p.sf.net/sfu/sophos-d2d-feb > _______________________________________________ > Denovoassembler-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
