[Genome] Question about BLAT: The case which BLAT will miss true hits.

Chi, Sung Wook Wed, 20 May 2009 08:37:47 -0700

Dear UCSC genome browser,

Hi. My name is Sung Wook Chi,a PhD student in Rockefeller Univ.
I've been using your BLAT program for mapping the high-throughput sequence 
data from SOLEXA/Illumina on genome without anyproblem.
But some people claimed about using BLAT instead of using BLASTN by mentioning 
that BLAT may miss some match on genome.


So I'd like to ask your opinion and thought about this issue and attached the 
complain. I would greatly appreciate you if you tell me about you thought 
about this.

My point was that BLAT will miss true hits that most likely matter. What the 
authors, and many people in the community, do not realize is that BLAT will 
miss even exact matches and will miss them by a mile. Let me highlight this 
last statement with the following 33 nucleotide sequence:
        GAGCCACCATGTGGTTGCTGGGAATTGAACTCA
The authors can verify, by going to www.ensembl.org and running BLAT with 
default settings on the mouse genome that BLAT will find no hits whatsoever. 
Now, if the authors select BLASTN with the default "Near-exact matches" they 
will find a handful of hits in the current chromosome assemblies. If they 
replace "Near-exact matches" with "Allow some local mismatch" and re-run 
BLASTN, the number of hits that they will find will increase. Finally, 
with "Allow some local mismatch" selected, they can hit 'Configure' and un-
check the 'RepeatMasker' filter, then rerun: this will take a while to 
complete; upon completion, BLASTN will report "20447 alignments". Note that 
even this number is an undercount: the above 33mer has nearly 8,000 exact, 
full-length copies in the mouse genome, and an additional 24,000 copies with a 
single-letter mismatch for a total of more than 32,000 copies with at most one 
mismatch anywhere along the length of the 33-mer. This means that BLAT (and 
BLASTN)
underestimate the number of hits even if the allowed error is very small (my 
experience is that independent of method used, high-throughput sequencing data 
have substantially more than one base error in 33 nucleotides of sequence). 
Why does this point matter? Because BLAT's inherent inability will mean that 
people cannot properly place all of their reads on the genome all of the time, 
which will turn affect the computed key estimates of false positives and false 
negatives for their method.

Thank you very much for you and your BLAT program.

Sincerely,

Chi, Sung Wook



Chi, Sung Wook
-------------------------------------------------
PhD Student, Tri-institutional Program in Computational Biology and Medicine.
Laboratory of Molecular Neuro-oncology,
The Rockefeller University.
1230 York Ave., Box 226
New York, NY 10021

Tel(lab): 212-327-7461
E-mail: [email protected]
       


_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

[Genome] Question about BLAT: The case which BLAT will miss true hits.

Reply via email to