Re: [Denovoassembler-users] Biological Abundance tests

Sébastien Boisvert Fri, 06 Dec 2013 07:29:11 -0800

On 26/11/13 04:18 AM, Francesco Strozzi wrote:
> Hi all,
> I would like to raise a question to this point of the Biological Abundances 
> estimation.
> How can we calculate the proportion of information that is not classified by 
> Ray Meta i.e. that part of the assembled information which is not colored ?
>Is it possible to derive that information from the ratio between the 
>totalAssembledKmerObservations and the totalAssembledColoredKmerObservations 
>from that XML file ?


Yes, that sounds like a good idea.

> Does this information holds also for the other taxonomic levels (i.e. genus 
> etc.) or is it mainly related to the species level classification ?
>

The number in the tag <totalAssembledColoredKmerObservations> is really the 
number of kmers that are assembled and colored (by sequences provided
by -search).


> Thanks and regards
> Francesco
>
>
> On Fri, Nov 15, 2013 at 5:24 PM, Sébastien Boisvert 
> <sebastien.boisver...@ulaval.ca <mailto:sebastien.boisver...@ulaval.ca>> 
> wrote:
>
>     On 12/11/13 04:18 PM, JC Grenier wrote:
>      > Hello again,
>
>     Hi,
>
>      >
>      > I'm now working on the biological abundance analysis part and got some 
> questions about the outputs that I'm getting. I'm using the
>      >NCBI-Finished-Bacterial-Genomes definitions (NCBI-Taxonomy) that you 
> are suggesting in your manual.
>      >
>      > My question is how do you make these final result files :
>      >
>      > 
> OUTPUT/BiologicalAbundances/0.Profile.NCBI-Finished-Bacterial-Genomes.tsv
>
>       From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ):
>
>     Demultiplexing signals from similar bacterial strains
>
>     Biological abundances were estimated using the product of the number of 
> k-mers matched in the distributed de Bruijn graph by the mode coverage of 
> k-mers that were uniquely colored. This number is called the number of k-mer 
> observations. The total number of k-mer observations is the sum of coverage 
> depth values of all colored k-mers. A proportion is calculated by dividing 
> the number of k-mer observations by the total.
>
>
>     These proportions are k-mer proportions, not cell proportions.
>
>      > and
>      > OUTPUT/BiologicalAbundances/0.Profile.TaxonomyRank=species.tsv
>
>       From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ):
>
>     Taxonomic profiling
>
>     All bacterial genomes available in GenBank [47] were utilized for 
> coloring the distributed de Bruijn graphs (Table S4 in Additional file 1). 
> Each k-mer was assigned to a taxon in the taxonomic tree. When a k-mer has 
> more than one taxon color, the coverage depth was assigned to the nearest 
> common ancestor.
>
>     Same here, these proportions are k-mer proportions, not cell proportions.
>
>
>      >
>      > When I choose my 2 difference "minimumContigLengths" parameters of 100 
> and 500, the 0.Profile.TaxonomyRank=species.tsv files are the same in both 
> analyses
>      >but the other files, 0.Profile.NCBI-Finished-Bacterial-Genomes.tsv 
> aren't...
>
>     Example of how these are calculated:
>
>     If I take sample SRS015799 from the Human Microbiome Project (buccal 
> mucosa),
>     I can see some S. pneumoniae ATCC 700669 in Bacterial-Genomes.tsv:
>
>     $ grep 700669 
> SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/0.Profile.Bacteria-Genomes.tsv
>     Streptococcus_pneumoniae_ATCC_700669_uid59287   0.0339856
>
>
>     In the XML file 
> SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/Bacteria-Genomes/SequenceAbundances.xml,
>     there are more information (best viewed in a web browser):
>
>     <entry><file>Streptococcus_pneumoniae_ATCC_700669_uid59287</file>
>     <sequence>0</sequence><name>gi|221230948|ref|NC_011900.1| Streptococcus 
> pneumoniae ATCC 7006</name>
>     <kmerLength>31</kmerLength><lengthInKmers>2221285</lengthInKmers>
>     
> <raw><kmerMatches>856331</kmerMatches><proportion>0.385512</proportion><modeKmerCoverage>2</modeKmerCoverage></raw>
>     
> <uniquelyColored><kmerMatches>10422</kmerMatches><proportion>0.00469188</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColored>
>     
> <assembled><kmerMatches>574340</kmerMatches><proportion>0.258562</proportion><modeKmerCoverage>19</modeKmerCoverage></assembled>
>     
> <uniquelyColoredAndAssembled><kmerMatches>6801</kmerMatches><proportion>0.00306174</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColoredAndAssembled>
>     
> <qualityControl><correlationColoredVsRaw>0.87672</correlationColoredVsRaw><correlationAssembledVsRaw>0.56542</correlationAssembledVsRaw><correlationAssembledVsColored>0.84475</correlationAssembledVsColored><hasPeak>1</hasPeak><hasHighFrequency>0</hasHighFrequency></qualityControl>
>     <demultiplexedKmerObservations>14557627</demultiplexedKmerObservations>
>     <proportion>0.0339856</proportion>
>     </entry>
>
>     At the top of this XML file, there are these numbers:
>
>     <totalAssembledKmerObservations>737864381</totalAssembledKmerObservations>
>     <totalAssembledKmers>30964275</totalAssembledKmers>
>     <totalColoredKmerObservations>428347476</totalColoredKmerObservations>
>     <totalColoredKmers>19288935</totalColoredKmers>
>     
> <totalColoredKmerObservation_EMBL_CDS>185161995</totalColoredKmerObservation_EMBL_CDS>
>     
> <totalAssembledColoredKmerObservations>365720420</totalAssembledColoredKmerObservations>
>     <totalAssembledColoredKmers>11583088</totalAssembledColoredKmers>
>
>     The proportion 3.39% (from the .tsv file) is the result of the division 
> of <demultiplexedKmerObservations> (14557627) by
>     <totalAssembledColoredKmerObservations> (which is 365720420).
>
>     14557627.0/365720420 => 0.03980534365568102
>
>
>     By changing the minimum contig length to 500, you are not assembling a 
> sizable part of the graph
>     because this parameter is also used as a filter for the minimum seed 
> length.
>
>
>     In the next release 2.3.1, the parameter -minimum-contig-length won't 
> impact on the seed selection and
>     an additional parameter called -minimum-seed-length will be added.
>
>      >
>      > What information is combined in order to form that particular file and 
> why does my two other files are exactly the same?
>      >
>
>     The taxonomy algorithm has no dependency on the number of assembled kmers 
> because it only uses colored kmers in the leaves
>     of the taxonomy tree (where the Last Common Ancestor or LCA is used).
>
>     The abundance estimation for genome sequences (via -search) needs the 
> total number of assembled kmers because
>     of the demultiplexing process (see paper link above).
>
>
>      > Thanks!
>
>     Thank for this very good question.
>
>     I hope I answered appropriately.
>
>      >
>      > --
>      > Jean-Christophe Grenier, M.Sc.
>      >
>
>
>             seb
>
>      > -----------------------------
>     ------------
>      > /Bio-informaticien/
>      > /Laboratoire de Philip Awadalla/
>      > /Laboratoire de Luis Barreiro/
>      > /CHU Sainte-Justine/
>      > //3175, Côte Sainte-Catherine, local B-607
>      > ///Tél : 514-345-4931 <tel:514-345-4931> poste 5199/
>      > -----------------------------------------
>
>
>     
> ------------------------------------------------------------------------------
>     DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
>     OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
>     Free app hosting. Or install the open source package on any LAMP server.
>     Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
>     
> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
>     _______________________________________________
>     Denovoassembler-users mailing list
>     Denovoassembler-users@lists.sourceforge.net 
> <mailto:Denovoassembler-users@lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>
>
>
>
> --
>
> Francesco Strozzi


------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Biological Abundance tests

Reply via email to