On 26/11/13 04:18 AM, Francesco Strozzi wrote: > Hi all, > I would like to raise a question to this point of the Biological Abundances > estimation. > How can we calculate the proportion of information that is not classified by > Ray Meta i.e. that part of the assembled information which is not colored ? >Is it possible to derive that information from the ratio between the >totalAssembledKmerObservations and the totalAssembledColoredKmerObservations >from that XML file ?
Yes, that sounds like a good idea. > Does this information holds also for the other taxonomic levels (i.e. genus > etc.) or is it mainly related to the species level classification ? > The number in the tag <totalAssembledColoredKmerObservations> is really the number of kmers that are assembled and colored (by sequences provided by -search). > Thanks and regards > Francesco > > > On Fri, Nov 15, 2013 at 5:24 PM, Sébastien Boisvert > <sebastien.boisver...@ulaval.ca <mailto:sebastien.boisver...@ulaval.ca>> > wrote: > > On 12/11/13 04:18 PM, JC Grenier wrote: > > Hello again, > > Hi, > > > > > I'm now working on the biological abundance analysis part and got some > questions about the outputs that I'm getting. I'm using the > >NCBI-Finished-Bacterial-Genomes definitions (NCBI-Taxonomy) that you > are suggesting in your manual. > > > > My question is how do you make these final result files : > > > > > OUTPUT/BiologicalAbundances/0.Profile.NCBI-Finished-Bacterial-Genomes.tsv > > From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ): > > Demultiplexing signals from similar bacterial strains > > Biological abundances were estimated using the product of the number of > k-mers matched in the distributed de Bruijn graph by the mode coverage of > k-mers that were uniquely colored. This number is called the number of k-mer > observations. The total number of k-mer observations is the sum of coverage > depth values of all colored k-mers. A proportion is calculated by dividing > the number of k-mer observations by the total. > > > These proportions are k-mer proportions, not cell proportions. > > > and > > OUTPUT/BiologicalAbundances/0.Profile.TaxonomyRank=species.tsv > > From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ): > > Taxonomic profiling > > All bacterial genomes available in GenBank [47] were utilized for > coloring the distributed de Bruijn graphs (Table S4 in Additional file 1). > Each k-mer was assigned to a taxon in the taxonomic tree. When a k-mer has > more than one taxon color, the coverage depth was assigned to the nearest > common ancestor. > > Same here, these proportions are k-mer proportions, not cell proportions. > > > > > > When I choose my 2 difference "minimumContigLengths" parameters of 100 > and 500, the 0.Profile.TaxonomyRank=species.tsv files are the same in both > analyses > >but the other files, 0.Profile.NCBI-Finished-Bacterial-Genomes.tsv > aren't... > > Example of how these are calculated: > > If I take sample SRS015799 from the Human Microbiome Project (buccal > mucosa), > I can see some S. pneumoniae ATCC 700669 in Bacterial-Genomes.tsv: > > $ grep 700669 > SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/0.Profile.Bacteria-Genomes.tsv > Streptococcus_pneumoniae_ATCC_700669_uid59287 0.0339856 > > > In the XML file > SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/Bacteria-Genomes/SequenceAbundances.xml, > there are more information (best viewed in a web browser): > > <entry><file>Streptococcus_pneumoniae_ATCC_700669_uid59287</file> > <sequence>0</sequence><name>gi|221230948|ref|NC_011900.1| Streptococcus > pneumoniae ATCC 7006</name> > <kmerLength>31</kmerLength><lengthInKmers>2221285</lengthInKmers> > > <raw><kmerMatches>856331</kmerMatches><proportion>0.385512</proportion><modeKmerCoverage>2</modeKmerCoverage></raw> > > <uniquelyColored><kmerMatches>10422</kmerMatches><proportion>0.00469188</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColored> > > <assembled><kmerMatches>574340</kmerMatches><proportion>0.258562</proportion><modeKmerCoverage>19</modeKmerCoverage></assembled> > > <uniquelyColoredAndAssembled><kmerMatches>6801</kmerMatches><proportion>0.00306174</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColoredAndAssembled> > > <qualityControl><correlationColoredVsRaw>0.87672</correlationColoredVsRaw><correlationAssembledVsRaw>0.56542</correlationAssembledVsRaw><correlationAssembledVsColored>0.84475</correlationAssembledVsColored><hasPeak>1</hasPeak><hasHighFrequency>0</hasHighFrequency></qualityControl> > <demultiplexedKmerObservations>14557627</demultiplexedKmerObservations> > <proportion>0.0339856</proportion> > </entry> > > At the top of this XML file, there are these numbers: > > <totalAssembledKmerObservations>737864381</totalAssembledKmerObservations> > <totalAssembledKmers>30964275</totalAssembledKmers> > <totalColoredKmerObservations>428347476</totalColoredKmerObservations> > <totalColoredKmers>19288935</totalColoredKmers> > > <totalColoredKmerObservation_EMBL_CDS>185161995</totalColoredKmerObservation_EMBL_CDS> > > <totalAssembledColoredKmerObservations>365720420</totalAssembledColoredKmerObservations> > <totalAssembledColoredKmers>11583088</totalAssembledColoredKmers> > > The proportion 3.39% (from the .tsv file) is the result of the division > of <demultiplexedKmerObservations> (14557627) by > <totalAssembledColoredKmerObservations> (which is 365720420). > > 14557627.0/365720420 => 0.03980534365568102 > > > By changing the minimum contig length to 500, you are not assembling a > sizable part of the graph > because this parameter is also used as a filter for the minimum seed > length. > > > In the next release 2.3.1, the parameter -minimum-contig-length won't > impact on the seed selection and > an additional parameter called -minimum-seed-length will be added. > > > > > What information is combined in order to form that particular file and > why does my two other files are exactly the same? > > > > The taxonomy algorithm has no dependency on the number of assembled kmers > because it only uses colored kmers in the leaves > of the taxonomy tree (where the Last Common Ancestor or LCA is used). > > The abundance estimation for genome sequences (via -search) needs the > total number of assembled kmers because > of the demultiplexing process (see paper link above). > > > > Thanks! > > Thank for this very good question. > > I hope I answered appropriately. > > > > > -- > > Jean-Christophe Grenier, M.Sc. > > > > > seb > > > ----------------------------- > ------------ > > /Bio-informaticien/ > > /Laboratoire de Philip Awadalla/ > > /Laboratoire de Luis Barreiro/ > > /CHU Sainte-Justine/ > > //3175, Côte Sainte-Catherine, local B-607 > > ///Tél : 514-345-4931 <tel:514-345-4931> poste 5199/ > > ----------------------------------------- > > > > ------------------------------------------------------------------------------ > DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access > Free app hosting. Or install the open source package on any LAMP server. > Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! > > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk > _______________________________________________ > Denovoassembler-users mailing list > Denovoassembler-users@lists.sourceforge.net > <mailto:Denovoassembler-users@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users > > > > > -- > > Francesco Strozzi ------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users