i have scoured our archives and found little regarding role of solexa
quality
scores as reported in fastq outputs in short read filtering.
my understanding is that a numerical score of -4 or greater indicates more
probability
mass on the called base than on any other. in checking 1e6 reads on each of
two lanes
i found the frequency of the event " fewer than three bases have score less
than -4" to be
4e-3 in one lane and 2e-3 in another. in other words, filtering by
requiring no more than
two < -4 scores would take you from a million reads to about 2000-4000,
assuming i have
not taken a biased sample (i may have, just took the first 1e6 in fastq).
is there any reason to regard a call with score < -4 to be much different
from an 'N'?
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing