Re: [Denovoassembler-users] Digital normalisation prior to a Ray run

David Eccles (gringer) Mon, 17 Sep 2012 16:47:20 -0700

On 18.09.2012 11:22, Torsten Seemann wrote:
>>  Does diginorm works with read paths in the de Bruijn stored in its Bloom
>> filter ?
>> If not, not only diginorm does not use pairs, but it does not use read
>> threading neither.
> It does nothing that complicated. Just looks at kmer frequencies and throws
> out reads that have k-mers we've already seen enough times. But it's
> one-pass, fast, memory efficient algorithm. Not good for GENOME de novo
> assembly at the moment, but very useful for de novo RNA-Seq as it flattens
> the coverages of transcripts.


Whether or not diginorm works well for RNA-Seq is debatable. I've
included an email from Brian Haas (the lead Trinity developer) on the
subject. He seems to think it would be better for de-novo *genome*
assembly (which has approximately uniform coverage) instead of de-novo
*transcriptome* assembly (which had different coverages for each
transcript / isoform):

-------- Original Message --------
Subject: [Trinityrnaseq-developers] new diginorm-like implementation in
Trinity
Date: Tue, 28 Aug 2012 09:56:47 -0400
From: Brian Haas <bh...@broadinstitute.org>
To: trinityrnaseq-develop...@lists.sourceforge.net
<trinityrnaseq-develop...@lists.sourceforge.net>

Hi all,

My initial attempts to use diginorm with Trinity didn't fare so well -
putting it mildly (but in diginorm's defense, I may not have run it
correctly or optimally).  The diginorm formula for normalization:

for read in dataset:
     if estimated_coverage(read) < C:
         accept(read)
    else:
        discard(read)


might theoretically and practically work fine for genomic sequencing
data where the expectation is uniform coverage, but in the case of
rna-seq data, where coverage is log-normally distributed, the above
forumla would be expected to deplete most all 'good' reads for
moderately to highly expressed transcripts, and instead enrich for
error-containing reads for those transcripts - given that
error-containing reads are more likely to fall below the set maximum
coverage threshold.

By changing the above formula to:

for read in dataset:
     if estimated_coverage(read) < C:
         accept(read)
    else:
        accept(read) with probability (C/estimated_coverage(read))

we simply downsample the highly expressed transcripts to a maximum
coverage of C.

I've built this into Trinity over the last few days and, from what I
can tell so far, it works AWESOME!!!    And, what I mean by this is
that with a small fraction of total reads, I'm obtaining near 100%
transcript reconstruction as with the entire read data set.

I'll be hammering out some more of these details over the next few
days, but wanted to share my thoughts on it now, and want to think
about including some aspect of this in the protocol paper, assuming it
continues to pan out well in my evaluation under way.  In case
anyone's curious, I've been keeping Titus appraised of these
activities.

best,

-b

-- 
--
Brian J. Haas
Manager, Genome Annotation and Analysis, Research and Development
The Broad Institute
http://broad.mit.edu/~bhaas

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Digital normalisation prior to a Ray run

Reply via email to