On 12/11/13 04:18 PM, JC Grenier wrote: > Hello again, Hi,
> > I'm now working on the biological abundance analysis part and got some > questions about the outputs that I'm getting. I'm using the >NCBI-Finished-Bacterial-Genomes definitions (NCBI-Taxonomy) that you are >suggesting in your manual. > > My question is how do you make these final result files : > > OUTPUT/BiologicalAbundances/0.Profile.NCBI-Finished-Bacterial-Genomes.tsv From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ): Demultiplexing signals from similar bacterial strains Biological abundances were estimated using the product of the number of k-mers matched in the distributed de Bruijn graph by the mode coverage of k-mers that were uniquely colored. This number is called the number of k-mer observations. The total number of k-mer observations is the sum of coverage depth values of all colored k-mers. A proportion is calculated by dividing the number of k-mer observations by the total. These proportions are k-mer proportions, not cell proportions. > and > OUTPUT/BiologicalAbundances/0.Profile.TaxonomyRank=species.tsv From the paper ( http://genomebiology.com/2012/13/12/R122#sec5 ): Taxonomic profiling All bacterial genomes available in GenBank [47] were utilized for coloring the distributed de Bruijn graphs (Table S4 in Additional file 1). Each k-mer was assigned to a taxon in the taxonomic tree. When a k-mer has more than one taxon color, the coverage depth was assigned to the nearest common ancestor. Same here, these proportions are k-mer proportions, not cell proportions. > > When I choose my 2 difference "minimumContigLengths" parameters of 100 and > 500, the 0.Profile.TaxonomyRank=species.tsv files are the same in both > analyses >but the other files, 0.Profile.NCBI-Finished-Bacterial-Genomes.tsv aren't... Example of how these are calculated: If I take sample SRS015799 from the Human Microbiome Project (buccal mucosa), I can see some S. pneumoniae ATCC 700669 in Bacterial-Genomes.tsv: $ grep 700669 SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/0.Profile.Bacteria-Genomes.tsv Streptococcus_pneumoniae_ATCC_700669_uid59287 0.0339856 In the XML file SRS015799-Ray-HMP.20/Assembly/BiologicalAbundances/Bacteria-Genomes/SequenceAbundances.xml, there are more information (best viewed in a web browser): <entry><file>Streptococcus_pneumoniae_ATCC_700669_uid59287</file> <sequence>0</sequence><name>gi|221230948|ref|NC_011900.1| Streptococcus pneumoniae ATCC 7006</name> <kmerLength>31</kmerLength><lengthInKmers>2221285</lengthInKmers> <raw><kmerMatches>856331</kmerMatches><proportion>0.385512</proportion><modeKmerCoverage>2</modeKmerCoverage></raw> <uniquelyColored><kmerMatches>10422</kmerMatches><proportion>0.00469188</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColored> <assembled><kmerMatches>574340</kmerMatches><proportion>0.258562</proportion><modeKmerCoverage>19</modeKmerCoverage></assembled> <uniquelyColoredAndAssembled><kmerMatches>6801</kmerMatches><proportion>0.00306174</proportion><modeKmerCoverage>17</modeKmerCoverage></uniquelyColoredAndAssembled> <qualityControl><correlationColoredVsRaw>0.87672</correlationColoredVsRaw><correlationAssembledVsRaw>0.56542</correlationAssembledVsRaw><correlationAssembledVsColored>0.84475</correlationAssembledVsColored><hasPeak>1</hasPeak><hasHighFrequency>0</hasHighFrequency></qualityControl> <demultiplexedKmerObservations>14557627</demultiplexedKmerObservations> <proportion>0.0339856</proportion> </entry> At the top of this XML file, there are these numbers: <totalAssembledKmerObservations>737864381</totalAssembledKmerObservations> <totalAssembledKmers>30964275</totalAssembledKmers> <totalColoredKmerObservations>428347476</totalColoredKmerObservations> <totalColoredKmers>19288935</totalColoredKmers> <totalColoredKmerObservation_EMBL_CDS>185161995</totalColoredKmerObservation_EMBL_CDS> <totalAssembledColoredKmerObservations>365720420</totalAssembledColoredKmerObservations> <totalAssembledColoredKmers>11583088</totalAssembledColoredKmers> The proportion 3.39% (from the .tsv file) is the result of the division of <demultiplexedKmerObservations> (14557627) by <totalAssembledColoredKmerObservations> (which is 365720420). 14557627.0/365720420 => 0.03980534365568102 By changing the minimum contig length to 500, you are not assembling a sizable part of the graph because this parameter is also used as a filter for the minimum seed length. In the next release 2.3.1, the parameter -minimum-contig-length won't impact on the seed selection and an additional parameter called -minimum-seed-length will be added. > > What information is combined in order to form that particular file and why > does my two other files are exactly the same? > The taxonomy algorithm has no dependency on the number of assembled kmers because it only uses colored kmers in the leaves of the taxonomy tree (where the Last Common Ancestor or LCA is used). The abundance estimation for genome sequences (via -search) needs the total number of assembled kmers because of the demultiplexing process (see paper link above). > Thanks! Thank for this very good question. I hope I answered appropriately. > > -- > Jean-Christophe Grenier, M.Sc. > seb > ----------------------------- ------------ > /Bio-informaticien/ > /Laboratoire de Philip Awadalla/ > /Laboratoire de Luis Barreiro/ > /CHU Sainte-Justine/ > //3175, Côte Sainte-Catherine, local B-607 > ///Tél : 514-345-4931 poste 5199/ > ----------------------------------------- ------------------------------------------------------------------------------ DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access Free app hosting. Or install the open source package on any LAMP server. Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
