On 22/04/13 04:25 PM, Egon Ozer wrote: > Thanks for the reply. > > I see that ticket 158 also includes reference to the negative proportions, so > I assume you don't want me to create a brand new ticket for that problem. > Please let me know >if you want a ticket specifically for the negative proportions issue. >
Ok. You can therefore add your information in that ticket then. > I also see that the ticket, as in the email chain below, just addresses the > proportion-greather-than-one issue in the "Depth=" files and suggests > that proportions should add up to 1 in "Terms.tsv," but I am seeing the > issue in the "Terms.tsv" and the three "Domain=" files. >Should I even be expecting the sum of proportions to equal 1 in any of these >Gene Ontology files, or will the sum likely always be > 1 >because kmers may be able to contribute to more than one ontology term? The problem seems to stem from that fact that a kmer can contribute to more than one GO term (like you already said). Basically, because of that fact, gene ontology terms are not something that are made to be 1-summable. On the other hand, For a given GO term, its proportion is presumably comparable across samples. Anyway, that's what I think about that. So the two issues are: 1. The recursive proportion for a GO term (that is the sum of its proportion and of those of all the children and grand children and so on) can greater than 1 because in the path from the root to a leaf, the same kmer can contribute to more than one GO term. In the taxonomy thing, kmers are allocated in the leaves of the taxonomic tree. So the problem is not there for taxonomy. 2. The sum of non-recursive proportions can be larger than 1. This is mostly due to the same fact that causes the issue #1 above -- a kmer can be allocated to more than one GO term. The consequence of that is that GO proportions are not summable. A fix to that would be to only put kmers on GO terms that are leaves. But that may reduce what you get out of the data. > > Thanks, > - Egon > > > On Apr 22, 2013, at 3:03 PM, Sébastien Boisvert wrote: > >> On 22/04/13 12:57 PM, Egon Ozer wrote: >>> I'm also observing that gene ontology proportions aren't adding up to 1. >>> These are on runs in Ray v2.1.0 as I have not yet had time to re-run on >>> v2.2.0. I didn't see anything addressing this issue in the changes from >>> v2.1.0 to v2.2.0 you posted to the list, so I thought I'd check with you >>> before devoting the time and resources to re-running the data on the newest >>> version. >>> >>> When I sum up all the Proportions in the >>> BiologicalAbundances/_GeneOntology/Terms.tsv file, the result is >>> 5.611734... Sums from the >>> 0.Profile.GeneOntologyDomain=biological_process.tsv, >>> ...molecular_function.tsv, and ...cellular_component.tsv files are 1.78, >>> 3.03, and 8.11, respectively. Is this perhaps because the same kmer is >>> mapping to multiple GeneOntology terms? >> >> For greater-than one values, there is this ticket: >> >> https://github.com/sebhtml/ray/issues/158 >> >> I has something to do with a recursive sum wherein a kmer contributes to >> more than one ontology term >> on itsd path from the root to the gene ontology term (a leaf). >> >>> >>> Just to mention, in the _GeneOntology folder, I'm noting that some of the >>> terms within the files show very large numbers of negative observations >>> (i.e. -1.623E+09 out of 1929700676 >>> total observations). I know you mentioned below that the files in the >>> _GeneOntology folder were experimental and you were considering not >>> including them in future releases, but >>> I wanted to point this out. There are no negative proportions in the >>> Terms.tsv file. >> >> Sounds like a bug. >> >> >> Can you create a ticket describing the problem: >> >> https://github.com/sebhtml/ray/issues/new >> >>> >>> Thanks for the help, >>> - Egon >>> >> >> Thanks. >> >>> >>> On Feb 14, 2013, at 1:56 PM, Sébastien Boisvert wrote: >>> >>>> [Please C.C. the mailing list] >>>> >>>> Hi james, >>>> >>>> On 02/14/2013 01:37 PM, James Vincent wrote: >>>>> Sébastien, >>>>> >>>>> Please pardon me for being obtuse, >>>> >>>> We call this science ;-) >>>> >>>> I do have a good explanation for this -- this is a known problem with Gene >>>> Ontology and you should >>>> stick to proportions that are not seated at a given depth which are >>>> reported in BiologicalAbundances/_GenOntology/Terms.xml >>>> (also available as Terms.tsv). >>>> >>>> See below. >>>> >>>>> but I do not see what you describe >>>>> in output files for even one level. I have attached the GeneOntology >>>>> output for one level of molecular function as an example. When I sort >>>>> this file by proportion the single largest proportion looks like this: >>>>> >>>>> #Identifier Name Proportion Observations Total >>>>> GO:0016462 pyrophosphatase activity 4.69785 352616 75059 >>>> >>>> In the documentation, only these files are described: >>>> >>>> In Documentation/GeneOntology.txt >>>> >>>> <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.xml >>>> <RayOutput>/BiologicalAbundances/_GeneOntology/Terms.tsv >>>> >>>> >>>> Each GO term has a depth (in fact, each term can have many paths to the >>>> root, each with a >>>> possibly different depth). >>>> >>>> If you take a taxonomy tree, k-mers are attached to leaves (genomes). >>>> That's a nice thing >>>> because k-mers can be classified at different level without reusing >>>> biological signal. >>>> >>>> In Gene Ontology, k-mers are attached to any term at any level. >>>> >>>> There can be a given k-mer that is attached to all the terms from the root >>>> to >>>> a particular term. >>>> >>>> In the design of Gene Ontology, each term can have an arbitrary number of >>>> parents in >>>> the directed acyclic graph. >>>> >>>> If you use Terms.tsv or Terms.xml or 0.Profile.GeneOntologyDomain.*, you >>>> should not be getting this behavior. >>>> >>>> But if you use the counts in the files for specific depths, then the count >>>> is recursive. Really often, the >>>> ratio "recursive count / total" is devoid of sense (that's why our >>>> documentation points to Terms.xml). >>>> >>>>> >>>>> The proportion number is 4.69, with 352,616 observations out of only >>>>> 75,059 total. This is the part that is confusing. >>>>> >>>> >>>> I see. >>>> >>>> For a given term, a set of k-mers are associated to it. But, any children >>>> of the term can also >>>> have the same k-mers attached to it. >>>> >>>> Therefore, the recursive counts do not make sense in some cases as the >>>> code counts the same thing over and over >>>> again. >>>> >>>> In our paper in Genome Biology, we used Terms.xml, not the recursive >>>> counts. >>>> We should probably remove the files at particular depths because sometimes >>>> the ratio are not relevant. >>>> >>>> The counts for each depth are more like a experimental feature. >>>> >>>> You should use results from Terms.xml (or Terms.tsv). >>>> >>>> From our Genome Biology paper: >>>> >>>> "Gene ontology profiling -- The de Bruijn graph was colored with coding >>>> sequences from the EMBL nucleotide sequence database [48] >>>> (EMBL CDS), which are mapped to gene ontology by transitivity using the >>>> uniprot mapping to gene ontology >>>> [49]. For each ontology term, coverage depths of colored k-mers were added >>>> to obtain its total number of >>>> k-mer observations." >>>> >>>> >>>> Example of a term in the XML file: >>>> >>>> >>>> <domain>molecular_function</domain> >>>> <paths><count>1</count> >>>> <path> >>>> <geneOntologyTerm><identifier>GO:0003674</identifier><name>molecular_function</name></geneOntologyTerm> >>>> <geneOntologyTerm><identifier>GO:0005488</identifier><name>binding</name></geneOntologyTerm> >>>> </path> >>>> </paths> >>>> <modeKmerCoverage>2</modeKmerCoverage><meanKmerCoverage>3.34305</meanKmerCoverage> >>>> <totalColoredKmerObservations>382976</totalColoredKmerObservations> >>>> <proportion>0.022287</proportion> >>>> <distribution> >>>> #Coverage Frequency >>>> 2 62170 >>>> 3 37829 >>>> 4 8136 >>>> 5 3333 >>>> 6 1159 >>>> 7 540 >>>> 8 329 >>>> 9 147 >>>> 10 160 >>>> 11 89 >>>> 12 79 >>>> 13 67 >>>> 14 45 >>>> >>>> >>>>> Jim >>>>> >>>>> P.S. We love Quebec - spent our honeymoon there (oh so long ago) and >>>>> camped around the peninsula. >>>>> >>>> >>>> Cool ! >>>> >>>> I hope I have answered as clearly as possible the behavior you are >>>> observing. If you need more information, >>>> please post again on the list. >>>> >>>> -Sébastien >>>> >>>>> >>>>> >>>>> On Thu, Feb 14, 2013 at 12:00 PM, Sébastien Boisvert >>>>> <[email protected]> wrote: >>>>>> Hi, >>>>>> >>>>>> On 02/14/2013 11:52 AM, James Vincent wrote: >>>>>>> Hi Sébastien, >>>>>>> >>>>>>> Thanks very much for your quick and detailed reply. >>>>>>> >>>>>>> I understand the details of proportion calculations and what they >>>>>>> are, but that des not square with the output files. >>>>>>> >>>>>>> The sum of proportions in the file Terms.tsv, for example, is 55. It >>>>>>> is not slightly off from 1. In other GO output files the sum of >>>>>>> proportions is a similarly large number, 50, 60 or more. >>>>>> >>>>>> The file Terms.tsv contains all levels of depth in the directed acyclic >>>>>> graph >>>>>> of Gene Ontology. >>>>>> >>>>>> If you take a particular depth, you should see something near 100%. >>>>>> >>>>>> Relevant files: >>>>>> >>>>>> $ ls|grep GeneO >>>>>> 0.Profile.GeneOntologyDomain=biological_process.tsv >>>>>> 0.Profile.GeneOntologyDomain=cellular_component.tsv >>>>>> 0.Profile.GeneOntologyDomain=molecular_function.tsv >>>>>> _GeneOntology >>>>>> >>>>>> $ ls _GeneOntology/ >>>>>> biological_process.Depth=0.tsv cellular_component.Depth=4.tsv >>>>>> molecular_function.Depth=1.tsv molecular_function.Depth=8.tsv >>>>>> biological_process.Depth=1.tsv cellular_component.Depth=5.tsv >>>>>> molecular_function.Depth=2.tsv molecular_function.Depth=9.tsv >>>>>> biological_process.Depth=2.tsv cellular_component.Depth=6.tsv >>>>>> molecular_function.Depth=3.tsv Terms.tsv >>>>>> cellular_component.Depth=0.tsv cellular_component.Depth=7.tsv >>>>>> molecular_function.Depth=4.tsv Terms.xml >>>>>> cellular_component.Depth=1.tsv cellular_component.Depth=8.tsv >>>>>> molecular_function.Depth=5.tsv >>>>>> cellular_component.Depth=2.tsv cellular_component.Depth=9.tsv >>>>>> molecular_function.Depth=6.tsv >>>>>> cellular_component.Depth=3.tsv molecular_function.Depth=0.tsv >>>>>> molecular_function.Depth=7.tsv >>>>>> >>>>>>> >>>>>>> The obvious examples is that the first few largest proportion numbers >>>>>>> add up to more than 2. They are all fractions like 0.5, 0.6 and so on. >>>>>>> Is there an error in my run or perhaps my interpretation? >>>>>>> >>>>>> >>>>>> Do you see this behavior if you look at a given depth and not at all the >>>>>> depths at once ? >>>>>> >>>>>>> Merci, >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> I expected the >>>>>>>>> sum to be 1.0. >>>>>>>> >>>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes >>>>>>>> it's a little bit less. This is >>>>>>>> because the demultiplexing process is not 100% accurate, but in >>>>>>>> general it really good. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 14, 2013 at 10:59 AM, Sébastien Boisvert >>>>>>> <[email protected]> wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> On 02/14/2013 10:13 AM, jjv5 wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I have used ray-meta with -gene-ontology enabled after downloading GO >>>>>>>>> data using the Main.sh script in the git repo. Everything completed >>>>>>>>> fine and produced >>>>>>>>> expected output. >>>>>>>>> >>>>>>>>> The result file Terms.tsv under BiologicalAbundances/_GeneOntology >>>>>>>>> contains proportions for the GO terms encountered. What is this >>>>>>>>> proportion number based on? >>>>>>>> >>>>>>>> For plain genomes (via the -search command), proportion are computed by >>>>>>>> demultiplexing the signal based on uniquely colored kmers. >>>>>>>> >>>>>>>> For taxonomy, the provided taxonomy tree is used to classify each >>>>>>>> observed kmer >>>>>>>> at the vertex in the tree where the earliest common ancestor is found. >>>>>>>> >>>>>>>> For gene ontology, kmer observations are gathered for each ontology >>>>>>>> term, and proportions >>>>>>>> are computed for each depth in the gene ontology directed acyclic >>>>>>>> graph. >>>>>>>> >>>>>>>>> Proportion of what? >>>>>>>> >>>>>>>> Of k-mers found in the de Bruijn subgraph that was built from the >>>>>>>> sequence reads >>>>>>>> provided to Ray. >>>>>>>> >>>>>>>> For example, if you want a number of bacterial cells, you need to >>>>>>>> further normalize >>>>>>>> by genome length, and so on. >>>>>>>> >>>>>>>>> The sum of the >>>>>>>>> proportion values in this file is some large integer. >>>>>>>> >>>>>>>> In directories in BiologicalAbundances, a file called >>>>>>>> SequenceAbundances.xml contain >>>>>>>> numerous counts. >>>>>>>> >>>>>>>> These large integers are either a number of k-mers, or a number of >>>>>>>> k-mer observations. >>>>>>>> A k-mer observation corresponds to a k-mer occurring 1 time. >>>>>>>> >>>>>>>> So for a life form X, its kmer observations are computed as follows: >>>>>>>> >>>>>>>> 1. Gather the k-mers that are unique (specific) to this life form X; >>>>>>>> 2. Compute a average number of observations (depth) for these objects; >>>>>>>> 3. For life form X, compute the number of matched k-mers in the graph, >>>>>>>> regardless if they are unique (breadth); >>>>>>>> 4. We the number of matched objects (#3.) and average depth (#2.), the >>>>>>>> demultiplexed number of k-mer observations is calculated. >>>>>>>> >>>>>>>>> I expected the >>>>>>>>> sum to be 1.0. >>>>>>>> >>>>>>>> Sometimes, it's a little bit more than 1.000 (like 1.00562), sometimes >>>>>>>> it's a little bit less. This is >>>>>>>> because the demultiplexing process is not 100% accurate, but in >>>>>>>> general it really good. >>>>>>>> >>>>>>>> see http://genomebiology.com/2012/13/12/R122/abstract >>>>>>>> >>>>>>>>> Is there further documentation somewhere? >>>>>>>> >>>>>>>> The documentation lives mainly in >>>>>>>> https://github.com/sebhtml/ray/tree/master/Documentation >>>>>>>> >>>>>>>> For what you are doing, these are relevant: >>>>>>>> >>>>>>>> * >>>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt >>>>>>>> * >>>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/NCBI-Taxonomy.txt >>>>>>>> * >>>>>>>> https://github.com/sebhtml/ray/blob/master/Documentation/GeneOntology.txt >>>>>>>> * https://github.com/sebhtml/ray/blob/master/Documentation/Taxonomy.txt >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> P.S. Thanks for making ray available. We like it a great deal. >>>>>>>>> >>>>>>>> >>>>>>>> Thanks ! >>>>>>>> >>>>>>>> It's nice to hear what our end users like (and what they don't like >>>>>>>> too !). >>>>>>>> >>>>>>>> >>>>>>>> There is a ticket in progress to further increase the accuracy of Ray >>>>>>>> Communities ( >>>>>>>> the solution that tells you what's in your sample) using topology. >>>>>>>> >>>>>>>> https://github.com/sebhtml/ray/issues/133 >>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> Free Next-Gen Firewall Hardware Offer >>>>>>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>>>>>> and get the hardware for free! Learn more. >>>>>>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>>>>>> _______________________________________________ >>>>>>>>> Denovoassembler-users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Free Next-Gen Firewall Hardware Offer >>>>>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>>>>> and get the hardware for free! Learn more. >>>>>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>>>>> _______________________________________________ >>>>>>>> Denovoassembler-users mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Free Next-Gen Firewall Hardware Offer >>>>>> Buy your Sophos next-gen firewall before the end March 2013 >>>>>> and get the hardware for free! Learn more. >>>>>> http://p.sf.net/sfu/sophos-d2d-feb >>>>>> _______________________________________________ >>>>>> Denovoassembler-users mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Free Next-Gen Firewall Hardware Offer >>>> Buy your Sophos next-gen firewall before the end March 2013 >>>> and get the hardware for free! Learn more. >>>> http://p.sf.net/sfu/sophos-d2d-feb >>>> _______________________________________________ >>>> Denovoassembler-users mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users >>> >> > ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
