Thanks for the reply. I see that ticket 158 also includes reference to the negative proportions, so I assume you don't want me to create a brand new ticket for that problem. Please let me know if you want a ticket specifically for the negative proportions issue.
I also see that the ticket, as in the email chain below, just addresses the proportion-greather-than-one issue in the "Depth=" files and suggests that proportions should add up to 1 in "Terms.tsv," but I am seeing the issue in the "Terms.tsv" and the three "Domain=" files. Should I even be expecting the sum of proportions to equal 1 in any of these Gene Ontology files, or will the sum likely always be > 1 because kmers may be able to contribute to more than one ontology term? Thanks, - Egon On Apr 22, 2013, at 3:03 PM, Sébastien Boisvert wrote: > On 22/04/13 12:57 PM, Egon Ozer wrote: >> I'm also observing that gene ontology proportions aren't adding up to 1. >> These are on runs in Ray v2.1.0 as I have not yet had time to re-run on >> v2.2.0. I didn't see anything addressing this issue in the changes from >> v2.1.0 to v2.2.0 you posted to the list, so I thought I'd check with you >> before devoting the time and resources to re-running the data on the newest >> version. >> >> When I sum up all the Proportions in the >> BiologicalAbundances/_GeneOntology/Terms.tsv file, the result is 5.611734... >> Sums from the 0.Profile.GeneOntologyDomain=biological_process.tsv, >> ...molecular_function.tsv, and ...cellular_component.tsv files are 1.78, >> 3.03, and 8.11, respectively. Is this perhaps because the same kmer is >> mapping to multiple GeneOntology terms? > > For greater-than one values, there is this ticket: > > https://github.com/sebhtml/ray/issues/158 > > I has something to do with a recursive sum wherein a kmer contributes to more > than one ontology term > on itsd path from the root to the gene ontology term (a leaf). > >> >> Just to mention, in the _GeneOntology folder, I'm noting that some of the >> terms within the files show very large numbers of negative observations >> (i.e. -1.623E+09 out of 1929700676 >> total observations). I know you mentioned below that the files in the >> _GeneOntology folder were experimental and you were considering not >> including them in future releases, but >> I wanted to point this out. There are no negative proportions in the >> Terms.tsv file. > > Sounds like a bug. > > > Can you create a ticket describing the problem: > > https://github.com/sebhtml/ray/issues/new > >> >> Thanks for the help, >> - Egon >> > > Thanks. > >> >> On Feb 14, 2013, at 1:56 PM, Sébastien Boisvert wrote: >> >>> [Please C.C. the mailing list] >>> >>> Hi james, >>> >>> On 02/14/2013 01:37 PM, James Vincent wrote: >>>> Sébastien, >>>> >>>> Please pardon me for being obtuse, >>> >>> We call this science ;-) >>> >>> I do have a good explanation for this -- this is a known problem with Gene >>> Ontology and you should >>> stick to proportions that are not seated at a given depth which are >>> reported in BiologicalAbundances/_GenOntology/Terms.xml >>> (also available as Terms.tsv). >>> >>> See below. >>> >>>> but I do not see what you describe >>>> in output files for even one level. I have attached the GeneOntology >>>> output for one level of molecular function as an example. When I sort >>>> this file by proportion the single largest proportion looks like this: >>>> >>>> #Identifier Name Proportion Observations Total >>>> GO:0016462 pyrophosphatase activity 4.69785 352616 75059 >>> >>> In the documentation, only these files are described: >>> >>> In Documentation/GeneOntology.txt >>> >>> <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.xml >>> <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.tsv >>> >>> >>> Each GO term has a depth (in fact, each term can have many paths to the >>> root, each with a >>> possibly different depth). >>> >>> If you take a taxonomy tree, k-mers are attached to leaves (genomes). >>> That's a nice thing >>> because k-mers can be classified at different level without reusing >>> biological signal. >>> >>> In Gene Ontology, k-mers are attached to any term at any level. >>> >>> There can be a given k-mer that is attached to all the terms from the root >>> to >>> a particular term. >>> >>> In the design of Gene Ontology, each term can have an arbitrary number of >>> parents in >>> the directed acyclic graph. >>> >>> If you use Terms.tsv or Terms.xml or 0.Profile.GeneOntologyDomain.*, you >>> should not be getting this behavior. >>> >>> But if you use the counts in the files for specific depths, then the count >>> is recursive. Really often, the >>> ratio "recursive count / total" is devoid of sense (that's why our >>> documentation points to Terms.xml). >>> >>>> >>>> The proportion number is 4.69, with 352,616 observations out of only >>>> 75,059 total. This is the part that is confusing. >>>> >>> >>> I see. >>> >>> For a given term, a set of k-mers are associated to it. But, any children >>> of the term can also >>> have the same k-mers attached to it. >>> >>> Therefore, the recursive counts do not make sense in some cases as the code >>> counts the same thing over and over >>> again. >>> >>> In our paper in Genome Biology, we used Terms.xml, not the recursive counts. >>> We should probably remove the files at particular depths because sometimes >>> the ratio are not relevant. >>> >>> The counts for each depth are more like a experimental feature. >>> >>> You should use results from Terms.xml (or Terms.tsv). >>> >>> From our Genome Biology paper: >>> >>> "Gene ontology profiling -- The de Bruijn graph was colored with coding >>> sequences from the EMBL nucleotide sequence database [48] >>> (EMBL CDS), which are mapped to gene ontology by transitivity using the >>> uniprot mapping to gene ontology >>> [49]. For each ontology term, coverage depths of colored k-mers were added >>> to obtain its total number of >>> k-mer observations." >>> >>> >>> Example of a term in the XML file: >>> >>> >>> <domain>molecular_function</domain> >>> <paths><count>1</count> >>> <path> >>> <geneOntologyTerm><identifier>GO:0003674</identifier><name>molecular_function</name></geneOntologyTerm> >>> <geneOntologyTerm><identifier>GO:0005488</identifier><name>binding</name></geneOntologyTerm> >>> </path> >>> </paths> >>> <modeKmerCoverage>2</modeKmerCoverage><meanKmerCoverage>3.34305</meanKmerCoverage> >>> <totalColoredKmerObservations>382976</totalColoredKmerObservations> >>> <proportion>0.022287</proportion> >>> <distribution> >>> #Coverage Frequency >>> 2 62170 >>> 3 37829 >>> 4 8136 >>> 5 3333 >>> 6 1159 >>> 7 540 >>> 8 329 >>> 9 147 >>> 10 160 >>> 11 89 >>> 12 79 >>> 13 67 >>> 14 45 >>> >>> >>>> Jim >>>> >>>> P.S. We love Quebec - spent our honeymoon there (oh so long ago) and >>>> camped around the peninsula. >>>> >>> >>> Cool ! >>> >>> I hope I have answered as clearly as possible the behavior you are >>> observing. If you need more information, >>> please post again on the list. >>> >>> -Sébastien >>> >>>> >>>> >>>> On Thu, Feb 14, 2013 at 12:00 PM, Sébastien Boisvert >>>> <[email protected]> wrote: >>>>> Hi, >>>>> >>>>> On 02/14/2013 11:52 AM, James Vincent wrote: >>>>>> Hi Sébastien, >>>>>> >>>>>> Thanks very much for your quick and detailed reply. >>>>>> >>>>>> I understand the details of proportion calculations and what they >>>>>> are, but that des not square with the output files. >>>>>> >>>>>> The sum of proportions in the file Terms.tsv, for example, is 55. It >>>>>> is not slightly off from 1. In other GO output files the sum of >>>>>> proportions is a similarly large number, 50, 60 or more. >>>>> >>>>> The file Terms.tsv contains all levels of depth in the directed acyclic >>>>> graph >>>>> of Gene Ontology. >>>>> >>>>> If you take a particular depth, you should see something near 100%. >>>>> >>>>> Relevant files: >>>>> >>>>> $ ls|grep GeneO >>>>> 0.Profile.GeneOntologyDomain=biological_process.tsv >>>>> 0.Profile.GeneOntologyDomain=cellular_component.tsv >>>>> 0.Profile.GeneOntologyDomain=molecular_function.tsv >>>>> _GeneOntology >>>>> >>>>> $ ls _GeneOntology/ >>>>> biological_process.Depth=0.tsv cellular_component.Depth=4.tsv >>>>> molecular_function.Depth=1.tsv molecular_function.Depth=8.tsv >>>>> biological_process.Depth=1.tsv cellular_component.Depth=5.tsv >>>>> molecular_function.Depth=2.tsv molecular_function.Depth=9.tsv >>>>> biological_process.Depth=2.tsv cellular_component.Depth=6.tsv >>>>> molecular_function.Depth=3.tsv Terms.tsv >>>>> cellular_component.Depth=0.tsv cellular_component.Depth=7.tsv >>>>> molecular_function.Depth=4.tsv Terms.xml >>>>> cellular_component.Depth=1.tsv cellular_component.Depth=8.tsv >>>>> molecular_function.Depth=5.tsv >>>>> cellular_component.Depth=2.tsv cellular_component.Depth=9.tsv >>>>> molecular_function.Depth=6.tsv >>>>> cellular_component.Depth=3.tsv molecular_function.Depth=0.tsv >>>>> molecular_function.Depth=7.tsv >>>>> >>>>>> >>>>>> The obvious examples is that the first few largest proportion numbers >>>>>> add up to more than 2. They are all fractions like 0.5, 0.6 and so on. >>>>>> Is there an error in my run or perhaps my interpretation? >>>>>> >>>>> >>>>> Do you see this behavior if you look at a given depth and not at all the >>>>> depths at once ? >>>>> >>>>>> Merci, >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>>>> I expected the >>>>>>>> sum to be 1.0. >>>>>>> >>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes >>>>>>> it's a little bit less. This is >>>>>>> because the demultiplexing process is not 100% accurate, but in general >>>>>>> it really good. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 14, 2013 at 10:59 AM, Sébastien Boisvert >>>>>> <[email protected]> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> On 02/14/2013 10:13 AM, jjv5 wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> I have used ray-meta with -gene-ontology enabled after downloading GO >>>>>>>> data using the Main.sh script in the git repo. Everything completed >>>>>>>> fine and produced >>>>>>>> expected output. >>>>>>>> >>>>>>>> The result file Terms.tsv under BiologicalAbundances/_GeneOntology >>>>>>>> contains proportions for the GO terms encountered. What is this >>>>>>>> proportion number based on? >>>>>>> >>>>>>> For plain genomes (via the -search command), proportion are computed by >>>>>>> demultiplexing the signal based on uniquely colored kmers. >>>>>>> >>>>>>> For taxonomy, the provided taxonomy tree is used to classify each >>>>>>> observed kmer >>>>>>> at the vertex in the tree where the earliest common ancestor is found. >>>>>>> >>>>>>> For gene ontology, kmer observations are gathered for each ontology >>>>>>> term, and proportions >>>>>>> are computed for each depth in the gene ontology directed acyclic graph. >>>>>>> >>>>>>>> Proportion of what? >>>>>>> >>>>>>> Of k-mers found in the de Bruijn subgraph that was built from the >>>>>>> sequence reads >>>>>>> provided to Ray. >>>>>>> >>>>>>> For example, if you want a number of bacterial cells, you need to >>>>>>> further normalize >>>>>>> by genome length, and so on. >>>>>>> >>>>>>>> The sum of the >>>>>>>> proportion values in this file is some large integer. >>>>>>> >>>>>>> In directories in BiologicalAbundances, a file called >>>>>>> SequenceAbundances.xml contain >>>>>>> numerous counts. >>>>>>> >>>>>>> These large integers are either a number of k-mers, or a number of >>>>>>> k-mer observations. >>>>>>> A k-mer observation corresponds to a k-mer occurring 1 time. >>>>>>> >>>>>>> So for a life form X, its kmer observations are computed as follows: >>>>>>> >>>>>>> 1. Gather the k-mers that are unique (specific) to this life form X; >>>>>>> 2. Compute a average number of observations (depth) for these objects; >>>>>>> 3. For life form X, compute the number of matched k-mers in the graph, >>>>>>> regardless if they are unique (breadth); >>>>>>> 4. We the number of matched objects (#3.) and average depth (#2.), the >>>>>>> demultiplexed number of k-mer observations is calculated. >>>>>>> >>>>>>>> I expected the >>>>>>>> sum to be 1.0. >>>>>>> >>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes >>>>>>> it's a little bit less. This is >>>>>>> because the demultiplexing process is not 100% accurate, but in general >>>>>>> it really good. >>>>>>> >>>>>>> see http://genomebiology.com/2012/13/12/R122/abstract >>>>>>> >>>>>>>> Is there further documentation somewhere? >>>>>>> >>>>>>> The documentation lives mainly in >>>>>>> https://github.com/sebhtml/ray/tree/master/Documentation >>>>>>> >>>>>>> For what you are doing, these are relevant: >>>>>>> >>>>>>> * >>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt >>>>>>> * >>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/NCBI-Taxonomy.txt >>>>>>> * >>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/GeneOntology.txt >>>>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/Taxonomy.txt >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Jim >>>>>>>> >>>>>>>> P.S. Thanks for making ray available. We like it a great deal. >>>>>>>> >>>>>>> >>>>>>> Thanks ! >>>>>>> >>>>>>> It's nice to hear what our end users like (and what they don't like too >>>>>>> !). >>>>>>> >>>>>>> >>>>>>> There is a ticket in progress to further increase the accuracy of Ray >>>>>>> Communities ( >>>>>>> the solution that tells you what's in your sample) using topology. >>>>>>> >>>>>>> https://github.com/sebhtml/ray/issues/133 >>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Free Next-Gen Firewall Hardware Offer >>>>>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>>>>> and get the hardware for free! Learn more. >>>>>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>>>>> _______________________________________________ >>>>>>>> Denovoassembler-users mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Free Next-Gen Firewall Hardware Offer >>>>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>>>> and get the hardware for free! Learn more. >>>>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>>>> _______________________________________________ >>>>>>> Denovoassembler-users mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Free Next-Gen Firewall Hardware Offer >>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>> and get the hardware for free! Learn more. >>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>> _______________________________________________ >>>>> Denovoassembler-users mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>> >>> >>> ------------------------------------------------------------------------------ >>> Free Next-Gen Firewall Hardware Offer >>> Buy your Sophos next-gen firewall before the end March 2013 >>> and get the hardware for free! Learn more. >>> http://p.sf.net/sfu/sophos-d2d-feb >>> _______________________________________________ >>> Denovoassembler-users mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >> > ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
