Re: [Denovoassembler-users] Question GO terms

Sébastien Boisvert Wed, 18 Dec 2013 14:01:43 -0800

On 18/12/13 04:11 PM, JC Grenier wrote:
> Thanks for your answer Sébastien.


Hi Jean-Christophe,

>
> Hi used the Terms.tsv to do this.

You file name is Matrix_allSamples.microbiome.cellComp.depth4-2.txt so I 
thought you used
the depth files.

> I extracted all my informations from all the individual samples results and 
> rebuild a file after.
> However, the pattern that I'm talking about is the trend that the categories 
> for each samples seems to react the same. You don't find it strange
>that all the categories in the middle of the graph seems to be exactly the 
>same, even the outlier points? It's like if there was some kind of bias in the 
>enrichments.
>
> In the example that I sent you, I can clearly see that all those points at 
> 0.12 for example are from the same sample.
> Here are some of the data points :
>
> For Sample EUR
> Golgi_apparatus : 0.122954
> endoplasmic_reticulum : 0.121175
> tight_junction : 0.12086
>
> For Sample RAN
> Golgi_apparatus : 0.185641
> endoplasmic_reticulum : 0.18288
> tight_junction :  0.182007
>

I see the pattern now.

In the EMBL_CDS sequence data, a given sequence can have many GO terms. So 
maybe what is going on here is that
you are seeing GO terms that correlate simply because the supporting data for 
these come from the k-mers
of the same EMBL_CDS sequences.

>
> What does this proportion column means anyway?

The proportions are computed in terms of k-mer observations divided by total 
number of
k-mer observations.

It is the quotient of <totalColoredKmerObservations> (inside <geneOntologyTerm> 
in Terms.xml)
by <totalColoredKmerObservations> (at the top of Terms.xml)


I don't think it is necessary to normalize here since they are
relative numbers.


> For the RAN sample, this column sum up to 116.054,

The sum will go beyond 100% because a given k-mer can contribute to many GO 
terms. Furthermore, we don't
go for a lowest common ancestor because (1) annotated GO terms are not always 
leaves and (2) a GO term can
have many parents.

> while for Sample EUR, it's summing up to 79.25 for all the terms. Am I mixing 
> to much things together by making a graph from this?

I think it is a good idea to compare things visually.

>
> Thanks a lot.
>
>
> 2013/12/18 Sébastien Boisvert <[email protected] 
> <mailto:[email protected]>>
>
>     On 29/11/13 04:31 PM, JC Grenier wrote:
>
>         Hi Sebastion,
>
>
>     Hi,
>
>
>
>         I'm communicating directly with you cause I have a figure to show you 
> concerning some analysis that I'm doing with GO terms.  I
>         looked into the different depth just to see if I can see some 
> difference between both groups that I'm working on and found some weird stuff.
>
>
>     You should use Terms.tsv or Terms.xml because those for specific depths 
> are not accurate because
>     of mainly 2 reasons:
>
>     1. EMBL_CDS sequences are annotated with any GO term, not just with 
> leaves in the GO classification
>     2. The GO classification is directed acyclic graph, but a GO term can 
> have many parents.
>
>     The depth files use recursive counts, whereas Terms.xml/Terms.tsv don't.
>
>     In our paper, we used the most abundant terms from Terms.xml regardless 
> of depth.
>        ( http://genomebiology.com/2012/__13/12/R122/figure/F5 
> <http://genomebiology.com/2012/13/12/R122/figure/F5> )
>
>     This was discussed before on https://github.com/sebhtml/__ray/issues/158 
> <https://github.com/sebhtml/ray/issues/158> and
>     on 
> http://permalink.gmane.org/__gmane.science.biology.ray-__genome-assembler/406 
> <http://permalink.gmane.org/gmane.science.biology.ray-genome-assembler/406>
>
>
>     P.S. the files for depths will likely be removed at some point. (see # 
> 158).
>
>
>         I took my results from the file Terms.tsv.
>
>         Here's one example. Can you tell me why this pattern exists?
>
>
>     I don't understand what you expect me to see here.
>
>     Perhaps it would be wise to draw 1 colored line per sample.
>
>
>         Like all the dots aligned (first why are they aligned...?) are from 
> the same
>         sample? Is this a coverage or number of reads artifact?
>
>
>     The proportions are computed in terms of k-mer observations divided by 
> total number of
>     k-mer observations. So I don't think it is necessary to normalize here 
> since they are
>     relative numbers.
>
>
>         Thanks a lot for your help
>
>         --
>         Jean-Christophe Grenier, M.Sc.
>
>         ------------------------------__-----------
>         /Bio-informaticien/
>         /Laboratoire de Philip Awadalla/
>         /Laboratoire de Luis Barreiro/
>         /CHU Sainte-Justine/
>         //3175, Côte Sainte-Catherine, local B-607
>         ///Tél : 514-345-4931 <tel:514-345-4931> poste 5199/
>         ------------------------------__-----------
>
>
>
>
>
> --
> Jean-Christophe Grenier, M.Sc.
>
> -----------------------------------------
> /Bio-informaticien/
> /Laboratoire de Philip Awadalla/
> /Laboratoire de Luis Barreiro/
> /CHU Sainte-Justine/
> //3175, Côte Sainte-Catherine, local B-607
> ///Tél : 514-345-4931 poste 5199/
> -----------------------------------------


------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Question GO terms

Reply via email to