a simple range of the numerical quality values should usually suffice to guess the Illumina/Sanger scale. Sanger is 1 to 40, Solexa is -5 to 60(?). In any case, if you have negative values you're dealing with a Solexa fastq file.

Vincent Carey wrote:
thanks to both Martin and Cei -- clearly I have to have the scale right,
and it is my hope to do a bit of analysis of the quality score distributions
and of decisionmaking using these -- positional effects are clearly of interest.

On Wed, Apr 15, 2009 at 6:25 PM, Cei Abreu-Goodger <[email protected] <mailto:[email protected]>> wrote:

    Hi Vincent,

    Are you taking into account that quality scores will tend to drop
    off towards the end of the run? I would probably restrict any sort
    of quality filtering to the first x bases of each read... From my
    experience, only a very small fraction of reads out of a "good" run
    would be removed due to general quality issues. Also, if your
    further pipeline is "quality-aware" (eg MAQ/bowtie for alignments)
    you can get away with not worrying initially about the quality of
    the reads. On the other hand, for some kinds of analysis I was
    dropping the quality scores and making plain fasta files. In these
    cases it would pay off to convert very low-quality bases to Ns,
    since I would get better coverage.

    Cheers,

    Cei

    Vincent Carey wrote:

        i have scoured our archives and found little regarding role of
        solexa
        quality
        scores as reported in fastq outputs in short read filtering.

        my understanding is that a numerical score of -4 or greater
        indicates more
        probability
        mass on the called base than on any other.  in checking 1e6
        reads on each of
        two lanes
        i found the frequency of the event " fewer than three bases have
        score less
        than -4" to be
        4e-3 in one lane and 2e-3 in another.  in other words, filtering by
        requiring no more than
        two < -4 scores would take you from a million reads to about
        2000-4000,
        assuming i have
        not taken a biased sample (i may have, just took the first 1e6
        in fastq).

        is there any reason to regard a call with score < -4 to be much
        different
        from an 'N'?

               [[alternative HTML version deleted]]

        _______________________________________________
        Bioc-sig-sequencing mailing list
        [email protected]
        <mailto:[email protected]>
        https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



-- The Wellcome Trust Sanger Institute is operated by Genome Research
    Limited, a charity registered in England with number 1021457 and a
    company registered in England with number 2742969, whose registered
    office is 215 Euston Road, London, NW1 2BE.




--
Vincent Carey, PhD
Biostatistics, Channing Lab
617 525 2265


--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to