On 18.09.2012 11:22, Torsten Seemann wrote: >> Does diginorm works with read paths in the de Bruijn stored in its Bloom >> filter ? >> If not, not only diginorm does not use pairs, but it does not use read >> threading neither. > It does nothing that complicated. Just looks at kmer frequencies and throws > out reads that have k-mers we've already seen enough times. But it's > one-pass, fast, memory efficient algorithm. Not good for GENOME de novo > assembly at the moment, but very useful for de novo RNA-Seq as it flattens > the coverages of transcripts.
Whether or not diginorm works well for RNA-Seq is debatable. I've included an email from Brian Haas (the lead Trinity developer) on the subject. He seems to think it would be better for de-novo *genome* assembly (which has approximately uniform coverage) instead of de-novo *transcriptome* assembly (which had different coverages for each transcript / isoform): -------- Original Message -------- Subject: [Trinityrnaseq-developers] new diginorm-like implementation in Trinity Date: Tue, 28 Aug 2012 09:56:47 -0400 From: Brian Haas <bh...@broadinstitute.org> To: trinityrnaseq-develop...@lists.sourceforge.net <trinityrnaseq-develop...@lists.sourceforge.net> Hi all, My initial attempts to use diginorm with Trinity didn't fare so well - putting it mildly (but in diginorm's defense, I may not have run it correctly or optimally). The diginorm formula for normalization: for read in dataset: if estimated_coverage(read) < C: accept(read) else: discard(read) might theoretically and practically work fine for genomic sequencing data where the expectation is uniform coverage, but in the case of rna-seq data, where coverage is log-normally distributed, the above forumla would be expected to deplete most all 'good' reads for moderately to highly expressed transcripts, and instead enrich for error-containing reads for those transcripts - given that error-containing reads are more likely to fall below the set maximum coverage threshold. By changing the above formula to: for read in dataset: if estimated_coverage(read) < C: accept(read) else: accept(read) with probability (C/estimated_coverage(read)) we simply downsample the highly expressed transcripts to a maximum coverage of C. I've built this into Trinity over the last few days and, from what I can tell so far, it works AWESOME!!! And, what I mean by this is that with a small fraction of total reads, I'm obtaining near 100% transcript reconstruction as with the entire read data set. I'll be hammering out some more of these details over the next few days, but wanted to share my thoughts on it now, and want to think about including some aspect of this in the protocol paper, assuming it continues to pan out well in my evaluation under way. In case anyone's curious, I've been keeping Titus appraised of these activities. best, -b -- -- Brian J. Haas Manager, Genome Annotation and Analysis, Research and Development The Broad Institute http://broad.mit.edu/~bhaas ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users