Re: [Denovoassembler-users] Biological Abundance tests

Francesco Strozzi Tue, 26 Nov 2013 01:18:49 -0800

Hi all,
I would like to raise a question to this point of the Biological Abundances
estimation.
How can we calculate the proportion of information that is not classified
by Ray Meta i.e. that part of the assembled information which is not
colored ? Is it possible to derive that information from the ratio between
the totalAssembledKmerObservations and the
totalAssembledColoredKmerObservations
from that XML file ?
Does this information holds also for the other taxonomic levels (i.e. genus
etc.) or is it mainly related to the species level classification ?


Thanks and regards
Francesco


On Fri, Nov 15, 2013 at 5:24 PM, Sébastien Boisvert <
sebastien.boisver...@ulaval.ca> wrote:

> On 12/11/13 04:18 PM, JC Grenier wrote:
> > Hello again,
>
> Hi,
>
> >
> > I'm now working on the biological abundance analysis part and got some
> questions about the outputs that I'm getting. I'm using the
> >NCBI-Finished-Bacterial-Genomes definitions (NCBI-Taxonomy) that you are
> suggesting in your manual.
> >
> > My question is how do you make these final result files :
> >
> > OUTPUT/BiologicalAbundances/0.Profile.NCBI-Finished-Bacterial-Genomes.tsv
>
>  From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ):
>
> Demultiplexing signals from similar bacterial strains
>
> Biological abundances were estimated using the product of the number of
> k-mers matched in the distributed de Bruijn graph by the mode coverage of
> k-mers that were uniquely colored. This number is called the number of
> k-mer observations. The total number of k-mer observations is the sum of
> coverage depth values of all colored k-mers. A proportion is calculated by
> dividing the number of k-mer observations by the total.
>
>
> These proportions are k-mer proportions, not cell proportions.
>
> > and
> > OUTPUT/BiologicalAbundances/0.Profile.TaxonomyRank=species.tsv
>
>  From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ):
>
> Taxonomic profiling
>
> All bacterial genomes available in GenBank [47] were utilized for coloring
> the distributed de Bruijn graphs (Table S4 in Additional file 1). Each
> k-mer was assigned to a taxon in the taxonomic tree. When a k-mer has more
> than one taxon color, the coverage depth was assigned to the nearest common
> ancestor.
>
> Same here, these proportions are k-mer proportions, not cell proportions.
>
>
> >
> > When I choose my 2 difference "minimumContigLengths" parameters of 100
> and 500, the 0.Profile.TaxonomyRank=species.tsv files are the same in both
> analyses
> >but the other files, 0.Profile.NCBI-Finished-Bacterial-Genomes.tsv
> aren't...
>
> Example of how these are calculated:
>
> If I take sample SRS015799 from the Human Microbiome Project (buccal
> mucosa),
> I can see some S. pneumoniae ATCC 700669 in Bacterial-Genomes.tsv:
>
> $ grep 700669
> SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/0.Profile.Bacteria-Genomes.tsv
> Streptococcus_pneumoniae_ATCC_700669_uid59287   0.0339856
>
>
> In the XML file
> SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/Bacteria-Genomes/SequenceAbundances.xml,
> there are more information (best viewed in a web browser):
>
> <entry><file>Streptococcus_pneumoniae_ATCC_700669_uid59287</file>
> <sequence>0</sequence><name>gi|221230948|ref|NC_011900.1| Streptococcus
> pneumoniae ATCC 7006</name>
> <kmerLength>31</kmerLength><lengthInKmers>2221285</lengthInKmers>
>
> <raw><kmerMatches>856331</kmerMatches><proportion>0.385512</proportion><modeKmerCoverage>2</modeKmerCoverage></raw>
>
> <uniquelyColored><kmerMatches>10422</kmerMatches><proportion>0.00469188</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColored>
>
> <assembled><kmerMatches>574340</kmerMatches><proportion>0.258562</proportion><modeKmerCoverage>19</modeKmerCoverage></assembled>
>
> <uniquelyColoredAndAssembled><kmerMatches>6801</kmerMatches><proportion>0.00306174</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColoredAndAssembled>
>
> <qualityControl><correlationColoredVsRaw>0.87672</correlationColoredVsRaw><correlationAssembledVsRaw>0.56542</correlationAssembledVsRaw><correlationAssembledVsColored>0.84475</correlationAssembledVsColored><hasPeak>1</hasPeak><hasHighFrequency>0</hasHighFrequency></qualityControl>
> <demultiplexedKmerObservations>14557627</demultiplexedKmerObservations>
> <proportion>0.0339856</proportion>
> </entry>
>
> At the top of this XML file, there are these numbers:
>
> <totalAssembledKmerObservations>737864381</totalAssembledKmerObservations>
> <totalAssembledKmers>30964275</totalAssembledKmers>
> <totalColoredKmerObservations>428347476</totalColoredKmerObservations>
> <totalColoredKmers>19288935</totalColoredKmers>
>
> <totalColoredKmerObservation_EMBL_CDS>185161995</totalColoredKmerObservation_EMBL_CDS>
>
> <totalAssembledColoredKmerObservations>365720420</totalAssembledColoredKmerObservations>
> <totalAssembledColoredKmers>11583088</totalAssembledColoredKmers>
>
> The proportion 3.39% (from the .tsv file) is the result of the division of
> <demultiplexedKmerObservations> (14557627) by
> <totalAssembledColoredKmerObservations> (which is 365720420).
>
> 14557627.0/365720420 => 0.03980534365568102
>
>
> By changing the minimum contig length to 500, you are not assembling a
> sizable part of the graph
> because this parameter is also used as a filter for the minimum seed
> length.
>
>
> In the next release 2.3.1, the parameter -minimum-contig-length won't
> impact on the seed selection and
> an additional parameter called -minimum-seed-length will be added.
>
> >
> > What information is combined in order to form that particular file and
> why does my two other files are exactly the same?
> >
>
> The taxonomy algorithm has no dependency on the number of assembled kmers
> because it only uses colored kmers in the leaves
> of the taxonomy tree (where the Last Common Ancestor or LCA is used).
>
> The abundance estimation for genome sequences (via -search) needs the
> total number of assembled kmers because
> of the demultiplexing process (see paper link above).
>
>
> > Thanks!
>
> Thank for this very good question.
>
> I hope I answered appropriately.
>
> >
> > --
> > Jean-Christophe Grenier, M.Sc.
> >
>
>
>        seb
>
> > -----------------------------
> ------------
> > /Bio-informaticien/
> > /Laboratoire de Philip Awadalla/
> > /Laboratoire de Luis Barreiro/
> > /CHU Sainte-Justine/
> > //3175, Côte Sainte-Catherine, local B-607
> > ///Tél : 514-345-4931 poste 5199/
> > -----------------------------------------
>
>
>
> ------------------------------------------------------------------------------
> DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
> OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
> Free app hosting. Or install the open source package on any LAMP server.
> Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
> _______________________________________________
> Denovoassembler-users mailing list
> Denovoassembler-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>



-- 

Francesco Strozzi

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Biological Abundance tests

Reply via email to