Hi Robert, Sorry for the late answer. I'm cc'ing the Bioc-sig-seq mailing list so my answer will be archived and searchable.
On 12/02/2010 12:52 PM, R. T. Sweeney wrote:
Dear Herve, My name is Robert Sweeney, a Stanford post-doc, new to R, looking forward to attending the R/bioconductor workshop Dec 9-10. I have a question about the R dustmasker-like script you suggested for removing low complexity short reads from fastq files. Here is the thread you wrote on it: https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-February/000170.html I am also new to the dustmasker type scoring. I am wondering if you could help me understand better. Attached is the output from the script. Could you briefly explain what the scores mean and about where I should choose a cut-off value?
I'm not a dustmasker expert. In fact I've never used it. There is a paper about the DUST implementation here: http://www.ncbi.nlm.nih.gov/pubmed/16796549 It seems that DUST has been used within BLAST for many years. But because it uses a sliding windows of length 64 nucleotides my understanding is that it is better suited for masking portions of a genome than for scoring the complexity of short reads. Anyway, the basic idea behind DUST is simple, intuitive, and can be easily adapted to work on short reads. The basic idea is that regions (or short reads) with a poor tri-nucleotide content (i.e. with a small number of *distinct* tri-nucleotides) are considered to have a low complexity. For example those sequences have a low complexity: AAAAAAAAAA contains only 1 tri-nculeotide: AAA ATATATATATAT contains 2 tri-nucleotide: ATA and TAT In the other hands, regions (or short reads) with a rich tri-nucleotide content (i.e. many distinct tri-nucleotides) are considered to have a high complexity. For example, the following 36-mer GGGCTACATGACGGTCCTGTATTTAGCCAGAGGATC has the highest complexity a 36-mer can have because all the tri-nucleotides contained in it (34 in total) are distinct. There are many ways to implement this idea. The one I chose is to penalize reads for which some tri-nucleotides are represented more than once. This is achieved by this part of the code: tnf2 <- tnf - 1L tnf2[tnf2 < 0] <- 0L scores <- rowSums(tnf2 * tnf2) where 'tnf' is a matrix containing the tri-nucleotide counts for each short read. The matrix has 1 row per short read and 64 columns (i.e. one column per each possible tri-nucleotide). The final 'scores' vector is just a numeric vector of length the number of short reads. As for about where to choose a cut-off value, it all depends on the kind of data you have, what you want to do with it, etc... it's really up to you. I would try different values and for each of them I would look at: reads[scores >= cut-off] where 'reads' is the DNAStringSet object containing your reads. Start using the highest score as the cut-off and see what you get (you should get only poly-As, poly-Cs, poly-Gs and poly-Ts). Then try with lower values until you are satisfied. This is of course a pragmatic approach with no serious statistical background :-)
Also, how would we generate the two output files: name_low_complexity_removed.fastq and low_complexity_reads.fastq?
You could try to use write.XStringSet(... , format="fastq") for this. Or use the IO facilities of the ShortRead package. It depends what you want to do with the scores. Alternatively you can also generate FASTA files and try to convert them to FASTQ files with an external tool.
Thanks for any help you can provide and look forward to meeting you at the workshop.
See you there. Cheers, H.
Thanks, Robert Sweeney
-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: [email protected] Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
