Dear UCSC genome browser,
Hi. My name is Sung Wook Chi,a PhD student in Rockefeller Univ.
I've been using your BLAT program for mapping the high-throughput sequence
data from SOLEXA/Illumina on genome without anyproblem.
But some people claimed about using BLAT instead of using BLASTN by mentioning
that BLAT may miss some match on genome.
So I'd like to ask your opinion and thought about this issue and attached the
complain. I would greatly appreciate you if you tell me about you thought
about this.
My point was that BLAT will miss true hits that most likely matter. What the
authors, and many people in the community, do not realize is that BLAT will
miss even exact matches and will miss them by a mile. Let me highlight this
last statement with the following 33 nucleotide sequence:
GAGCCACCATGTGGTTGCTGGGAATTGAACTCA
The authors can verify, by going to www.ensembl.org and running BLAT with
default settings on the mouse genome that BLAT will find no hits whatsoever.
Now, if the authors select BLASTN with the default "Near-exact matches" they
will find a handful of hits in the current chromosome assemblies. If they
replace "Near-exact matches" with "Allow some local mismatch" and re-run
BLASTN, the number of hits that they will find will increase. Finally,
with "Allow some local mismatch" selected, they can hit 'Configure' and un-
check the 'RepeatMasker' filter, then rerun: this will take a while to
complete; upon completion, BLASTN will report "20447 alignments". Note that
even this number is an undercount: the above 33mer has nearly 8,000 exact,
full-length copies in the mouse genome, and an additional 24,000 copies with a
single-letter mismatch for a total of more than 32,000 copies with at most one
mismatch anywhere along the length of the 33-mer. This means that BLAT (and
BLASTN)
underestimate the number of hits even if the allowed error is very small (my
experience is that independent of method used, high-throughput sequencing data
have substantially more than one base error in 33 nucleotides of sequence).
Why does this point matter? Because BLAT's inherent inability will mean that
people cannot properly place all of their reads on the genome all of the time,
which will turn affect the computed key estimates of false positives and false
negatives for their method.
Thank you very much for you and your BLAT program.
Sincerely,
Chi, Sung Wook
Chi, Sung Wook
-------------------------------------------------
PhD Student, Tri-institutional Program in Computational Biology and Medicine.
Laboratory of Molecular Neuro-oncology,
The Rockefeller University.
1230 York Ave., Box 226
New York, NY 10021
Tel(lab): 212-327-7461
E-mail: [email protected]
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome