Re: [Denovoassembler-users] Biological Abundance tests

Sébastien Boisvert Fri, 15 Nov 2013 09:26:07 -0800

On 12/11/13 04:18 PM, JC Grenier wrote:
> Hello again,

Hi,


>
> I'm now working on the biological abundance analysis part and got some 
> questions about the outputs that I'm getting. I'm using the
>NCBI-Finished-Bacterial-Genomes definitions (NCBI-Taxonomy) that you are 
>suggesting in your manual.
>
> My question is how do you make these final result files :
>
> OUTPUT/BiologicalAbundances/0.Profile.NCBI-Finished-Bacterial-Genomes.tsv

 From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ):

Demultiplexing signals from similar bacterial strains

Biological abundances were estimated using the product of the number of k-mers 
matched in the distributed de Bruijn graph by the mode coverage of k-mers that 
were uniquely colored. This number is called the number of k-mer observations. 
The total number of k-mer observations is the sum of coverage depth values of 
all colored k-mers. A proportion is calculated by dividing the number of k-mer 
observations by the total.


These proportions are k-mer proportions, not cell proportions.

> and
> OUTPUT/BiologicalAbundances/0.Profile.TaxonomyRank=species.tsv

 From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ):

Taxonomic profiling

All bacterial genomes available in GenBank [47] were utilized for coloring the 
distributed de Bruijn graphs (Table S4 in Additional file 1). Each k-mer was 
assigned to a taxon in the taxonomic tree. When a k-mer has more than one taxon 
color, the coverage depth was assigned to the nearest common ancestor.

Same here, these proportions are k-mer proportions, not cell proportions.


>
> When I choose my 2 difference "minimumContigLengths" parameters of 100 and 
> 500, the 0.Profile.TaxonomyRank=species.tsv files are the same in both 
> analyses
>but the other files, 0.Profile.NCBI-Finished-Bacterial-Genomes.tsv aren't...

Example of how these are calculated:

If I take sample SRS015799 from the Human Microbiome Project (buccal mucosa),
I can see some S. pneumoniae ATCC 700669 in Bacterial-Genomes.tsv:

$ grep 700669 
SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/0.Profile.Bacteria-Genomes.tsv
Streptococcus_pneumoniae_ATCC_700669_uid59287   0.0339856


In the XML file 
SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/Bacteria-Genomes/SequenceAbundances.xml,
there are more information (best viewed in a web browser):

<entry><file>Streptococcus_pneumoniae_ATCC_700669_uid59287</file>
<sequence>0</sequence><name>gi|221230948|ref|NC_011900.1| Streptococcus 
pneumoniae ATCC 7006</name>
<kmerLength>31</kmerLength><lengthInKmers>2221285</lengthInKmers>
<raw><kmerMatches>856331</kmerMatches><proportion>0.385512</proportion><modeKmerCoverage>2</modeKmerCoverage></raw>
<uniquelyColored><kmerMatches>10422</kmerMatches><proportion>0.00469188</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColored>
<assembled><kmerMatches>574340</kmerMatches><proportion>0.258562</proportion><modeKmerCoverage>19</modeKmerCoverage></assembled>
<uniquelyColoredAndAssembled><kmerMatches>6801</kmerMatches><proportion>0.00306174</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColoredAndAssembled>
<qualityControl><correlationColoredVsRaw>0.87672</correlationColoredVsRaw><correlationAssembledVsRaw>0.56542</correlationAssembledVsRaw><correlationAssembledVsColored>0.84475</correlationAssembledVsColored><hasPeak>1</hasPeak><hasHighFrequency>0</hasHighFrequency></qualityControl>
<demultiplexedKmerObservations>14557627</demultiplexedKmerObservations>
<proportion>0.0339856</proportion>
</entry>

At the top of this XML file, there are these numbers:

<totalAssembledKmerObservations>737864381</totalAssembledKmerObservations>
<totalAssembledKmers>30964275</totalAssembledKmers>
<totalColoredKmerObservations>428347476</totalColoredKmerObservations>
<totalColoredKmers>19288935</totalColoredKmers>
<totalColoredKmerObservation_EMBL_CDS>185161995</totalColoredKmerObservation_EMBL_CDS>
<totalAssembledColoredKmerObservations>365720420</totalAssembledColoredKmerObservations>
<totalAssembledColoredKmers>11583088</totalAssembledColoredKmers>

The proportion 3.39% (from the .tsv file) is the result of the division of 
<demultiplexedKmerObservations> (14557627) by
<totalAssembledColoredKmerObservations> (which is 365720420).

14557627.0/365720420 => 0.03980534365568102


By changing the minimum contig length to 500, you are not assembling a sizable 
part of the graph
because this parameter is also used as a filter for the minimum seed length.


In the next release 2.3.1, the parameter -minimum-contig-length won't impact on 
the seed selection and
an additional parameter called -minimum-seed-length will be added.

>
> What information is combined in order to form that particular file and why 
> does my two other files are exactly the same?
>

The taxonomy algorithm has no dependency on the number of assembled kmers 
because it only uses colored kmers in the leaves
of the taxonomy tree (where the Last Common Ancestor or LCA is used).

The abundance estimation for genome sequences (via -search) needs the total 
number of assembled kmers because
of the demultiplexing process (see paper link above).


> Thanks!

Thank for this very good question.

I hope I answered appropriately.

>
> --
> Jean-Christophe Grenier, M.Sc.
>


       seb

> -----------------------------
------------
> /Bio-informaticien/
> /Laboratoire de Philip Awadalla/
> /Laboratoire de Luis Barreiro/
> /CHU Sainte-Justine/
> //3175, Côte Sainte-Catherine, local B-607
> ///Tél : 514-345-4931 poste 5199/
> -----------------------------------------


------------------------------------------------------------------------------
DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
Free app hosting. Or install the open source package on any LAMP server.
Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Biological Abundance tests

Reply via email to