hi, the sequence is a repeat region and repeat masked by default. starting gfServer with -repMatch=1000000 got me about 1000 hits for that sequence.
--- On Wed, 5/20/09, Chi, Sung Wook <[email protected]> wrote: > From: Chi, Sung Wook <[email protected]> > Subject: [Genome] Question about BLAT: The case which BLAT will miss true > hits. > To: [email protected] > Date: Wednesday, May 20, 2009, 8:12 AM > Dear UCSC genome browser, > > Hi. My name is Sung Wook Chi,a PhD student in Rockefeller > Univ. > I've been using your BLAT program for mapping the > high-throughput sequence > data from SOLEXA/Illumina on genome without anyproblem. > But some people claimed about using BLAT instead of using > BLASTN by mentioning > that BLAT may miss some match on genome. > > So I'd like to ask your opinion and thought about this > issue and attached the > complain. I would greatly appreciate you if you tell me > about you thought > about this. > > My point was that BLAT will miss true hits that most likely > matter. What the > authors, and many people in the community, do not realize > is that BLAT will > miss even exact matches and will miss them by a mile. Let > me highlight this > last statement with the following 33 nucleotide sequence: > GAGCCACCATGTGGTTGCTGGGAATTGAACTCA > The authors can verify, by going to www.ensembl.org and > running BLAT with > default settings on the mouse genome that BLAT will find no > hits whatsoever. > Now, if the authors select BLASTN with the default > "Near-exact matches" they > will find a handful of hits in the current chromosome > assemblies. If they > replace "Near-exact matches" with "Allow some local > mismatch" and re-run > BLASTN, the number of hits that they will find will > increase. Finally, > with "Allow some local mismatch" selected, they can hit > 'Configure' and un- > check the 'RepeatMasker' filter, then rerun: this will take > a while to > complete; upon completion, BLASTN will report "20447 > alignments". Note that > even this number is an undercount: the above 33mer has > nearly 8,000 exact, > full-length copies in the mouse genome, and an additional > 24,000 copies with a > single-letter mismatch for a total of more than 32,000 > copies with at most one > mismatch anywhere along the length of the 33-mer. This > means that BLAT (and > BLASTN) > underestimate the number of hits even if the allowed error > is very small (my > experience is that independent of method used, > high-throughput sequencing data > have substantially more than one base error in 33 > nucleotides of sequence). > Why does this point matter? Because BLAT's inherent > inability will mean that > people cannot properly place all of their reads on the > genome all of the time, > which will turn affect the computed key estimates of false > positives and false > negatives for their method. > > Thank you very much for you and your BLAT program. > > Sincerely, > > Chi, Sung Wook > > > > Chi, Sung Wook > ------------------------------------------------- > PhD Student, Tri-institutional Program in Computational > Biology and Medicine. > Laboratory of Molecular Neuro-oncology, > The Rockefeller University. > 1230 York Ave., Box 226 > New York, NY 10021 > > Tel(lab): 212-327-7461 > E-mail: [email protected] > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
