Re: [Genome] pre-query on SNP data schema

Nathaniel Pearson Thu, 10 Sep 2009 13:38:00 -0700

Hi Angie,

Wow, thanks for your informative replies!  My turn to digest for a bit, to
make sure I understand everything.


Thanks,

Nathan

On Thu, Sep 10, 2009 at 2:39 PM, Angie Hinrichs <[email protected]> wrote:

> Hi Nathan,
>
> I will answer each of your questions below, but I'll begin with some
> information that applies to many of the questions.
>
> First, start and end coordinates in any UCSC database table, including
> snp tables regardless of class, use a numbering scheme that we call
> 0-based, half open, as opposed to the more intuitive 1-based, fully
> closed numbering system that is most commonly used elsewhere.  0-based
> means that the first base of the chromosome is 0.  Half open means
> that the end coordinate is for the base after the last base included
> in the region (end = 0-based index of last base + 1).  That addition
> of 1 due to the open end makes the end coordinates appear 1-based.  In
> order to convert UCSC coordinates to 1-based, fully closed, simply add
> 1 to the start coord.
>
> We use that numbering system internally because it makes coordinate
> arithmetic easier, and that reduces the number of bugs in our code.
> 0-based numbering is more natural for programmers.  Due to the
> half-open coords, the size of an item is end-start, the start of an
> intron is equal to the end of the previous exon, and so on -- none of
> the +1 / -1 fencepost conditions of fully closed coords.
>
> Coordinates displayed in the Genome Browser web pages are 1-based,
> fully closed as people would naturally expect -- we add 1 to
> chromStart when printing out a position range, to mask the
> non-intuitive numbering system of our database tables.
>
> Second, the inconsistencies between refUCSC and class/observed make
> more sense in light of how those values are independently derived.
> Also, we flag those inconsistencies in the table snpNNNExceptions
> (snp129Exceptions, snp130Exceptions), and snpNNNExceptions can be used
> to filter out suspect mappings from snpNNN.
>
> The fundamental data in dbSNP are submissions from wet labs (or
> consortia) that have observed polymorphisms when sequencing multiple
> individuals.  The labs submit flanking sequences (sequences to the
> left and right of the polymorphic site), observed alleles, and
> information about the sample population, technology etc.  Each
> submission is given an ID that begins with ss (submitted SNP) and ends
> with a number.
>
> dbSNP then maps all flanking sequences to the reference genome and
> several alternate reference genomes using a complex process -- flanks
> are aligned separately and then custom-processed back into pairs; then
> each submitted SNP's mapping is whatever falls between its paired
> flank mappings on the reference or alternate genome.  Using the
> mappings of submitted SNPs, dbSNP clusters the submitted SNPs into
> reference SNPs.  Each reference SNP has a stable ID composed of "rs"
> and a number.
>
> The observed alleles in our snpNNN tables are taken directly from the
> original submissions.  I believe class is a function of the observed
> alleles collected from all submitted SNPs in each reference SNP
> cluster.  However, refUCSC is simply the genomic bases that appear at
> dbSNP's mapped coordinates.  refUCSC is not a function of observed
> alleles, but rather of the submitted flanking sequences and dbSNP's
> mapping process.  Inconsistencies do arise, and then more
> investigation is needed in order to determine which piece of
> information is wrong.
>
> dbSNP itself includes some measures of confidence in each SNP mapping.
> Each reference SNP is assigned a weight: 1 means that it has a unique
> mapping, 2 means that it has a couple or a few, and 3 means that it
> has many mappings.  Multiple mappings reduce confidence in a SNP --
> are we seeing real polymorphism, or just almost-identical pieces of
> the genome?  dbSNP also describes its alignments with a "locus type"
> (locType in our snpNNN tables), with 6 possible values.  The first 3
> simply describe whether there are >1, 1 or 0 bases between the mapped
> flanks.  Inconsistencies between class and locType (and thus refUCSC
> as you have noted) are easy to identify, and we flag then in
> snpNNNExceptions.  The latter 3 locType values are more indicative of
> difficulty in mapping: they indicate that there was a gap in the
> alignment of flanking sequence adjacent to the polymorphic site.  We
> flag the occurrences of the latter 3 types in snpNNNExceptions, and on
> the genome browser details page for a SNP, we suggest that the user
> inspect our local re-alignment of the flanking sequences to the
> neighboring genomic sequence.  Sometimes it appears that a better
> mapping would have been possible, and perhaps the given coordinates
> are not correct.
>
> Finally, UCSC takes only a slice of the massive dbSNP database: we
> show only the clustered reference SNPs that have been mapped to the
> reference genome on which the browser is built, with a weight of 1, 2
> or 3.  We discard mappings of SNPs to alternates such as the Celera or
> Venter genome sequences, and we discard SNPs that are not mapped to
> any ref/alt, or SNPs that are mapped to so many locations that they
> are assigned a weight greater than 3.  Also, we store a small subset
> of the types of data stored in dbSNP.  If you download the entire
> dbSNP, you can drill down further into the evidence for each SNP but
> the learning curve is even steeper.
>
>
> > In our group's efforts to accurately parse UCSC human SNP records,
> > several small puzzles have emerged.  First, a few general questions
> > about SNP records of any class:
> >
> > 1) As a value in the 'refUCSC' field, does "-" always simply denote a
> > gap relative to a subject (or other some other reference) genome?
>
> Yes.  Ultimately it means that the flanking sequence alignments are
> contiguous on the reference genome.
>
>
> > 2) Is the value in the 'refUCSC' field, or its reverse complement (if
> > and only if "-" is the value of 'strand' but not the value of
> > 'refUCSC'?), always at least an implied value of the 'observed' field,
> > in addition to any other value(s) listed there?  That is, for entries
> > where the values of 'observed' do not include either the value of
> > 'refUCSC' or its reverse complement, are we to presume that the
> > missing value would be included in a more verbose population of the
> > 'observed' field?
>
> Yes -- or that there might have been an issue in mapping flanking
> sequences to the genome.
>
>
> > 3) Are all values for 'observed' given as part of the strand specified
> > in the 'strand' field, while values in 'refUCSC' are given as part of
> > the plus strand, regardless of the value listed in the 'strand' field?
>
> Yes -- observed sequences come from submitters, and they may align to
> either strand of the reference genome.
>
>
> > 4) What mapping, if any, holds between the allele state listed in
> > 'refUCSC' and the ancestral (versus derived) allele state for that
> > SNP?
>
> I would not expect a correlation.  The reference genome is a mosaic of
> ten individual genomes -- better than one, but still very few samples.
> Without an outgroup to human, I don't think we can determine ancestral
> state anyway.  Finally, the accuracy of refUCSC also depends on the
> accuracy of the mapping of flanking sequences, so anything odd about
> the mapping reduces confidence in chromStart, chromEnd and refUCSC.
>
> We have another track from Human Genome Diversity Project (HGDP Allele
> Freq), where they sampled fewer SNPs (~660,000) on many individuals,
> and they attempted to guess an ancestral allele -- but by using the
> allele found in alignments of the human reference genome to the
> reference genome of a single chimp, IIRC.
>
> I believe several research groups have made more sophisticated
> attempts to determine ancestral states, including researchers in
> UCSC's own Haussler group, and Javier Herrero at Sanger.
>
>
> > Next a few questions about SNP records of class "single":
> >
> > 1) For a class "single" SNP entry, it appears that the value of
> > 'chromStart' equals that of 'chromEnd' if and only if the value of
> > 'refUCSC' is "-" (e.g., rs3542401, rs1755135).  Yet the allele states
> > (always multiple) listed in the observed field never include "-", but
> > instead always appear as single bases (e.g., "A/T").  How does the "-"
> > value in 'refUCSC' relate to the multiple allele state values in
> > 'observed'?
>
> Class seems to be a function of observed -- class is single if and
> only if all observed alleles are single-base.  (Note: sometimes IUPAC
> ambiguous bases such as R, K etc are given.)  However, chromStart and
> chromEnd are determined by dbSNP from their alignment of flanking
> sequences to the reference genome.  The bases between those
> coordinates determine refUCSC.  When the reported SNP and the mapping
> to the reference produce inconsistent values, that reduces confidence
> in the mapped SNP.
>
> When class is single but refUCSC is not a single base, an exception
> (either SingleClassZeroSpan or SingleClassLongerSpan) is stored in
> snpNNNExceptions, and described on the SNP details page.
>
>
> > 2) Given that the values of 'chromStart' and 'chromEnd' values are
> > equal only where the value of 'refUCSC' is "-", are we right to infer
> > that such cases represent single-base insertions/deletions, while all
> > other class "single" cases represent single-base substitutions?  If
> > this interpretation is right, why is SNP rs17551353 (strand = -;
> > refUCSC = -; observed = C/G) classified as class "single", while SNP
> > rs28383030 (strand = "-"; refUCSC = "A"; observed = "-/T") is
> > classified as class "in-del"?
>
> Class is determined from observed before the mapping to the reference
> genome.  refUCSC is a function of the mapping of flanking sequences.
> They can have inconsistent results, and more info is needed to resolve
> each case.
>
>
> > 3) In some class "single" entries (e.g., rs5869813, rs61556558), the
> > value of 'refUCSC' is a multibase string, but each allele state in
> > 'observed' is a single base.  How are such entries (specifically, the
> > multibase value of 'refUCSC') to be interpreted, especially for
> > parsing which (class =...) "single" base is the site of variation?  In
> > what cases, if any (or all), are we to infer that the site of
> > variation for a class "single" SNP is 'chromStart'+1?
>
> In those cases, dbSNP's mapping would have us believe that *all* bases
> in refUCSC are replaced by the single bases in observed, i.e. the SNP
> is a deletion from the reference genome and the reference genome's
> allele simply wasn't reported by any submitters.  Interestingly, your
> two example SNPs are in build 129 but not build 130 -- they both have
> been merged into overlapping single-base SNPs whose mappings are
> single-base as expected.  I think this means that dbSNP identified and
> corrected a problem either with some flanking sequences or with their
> mapping algorithm.
>
> However, snp130 still has examples such as rs72497839.  If you view
> that in the genome browser, and click on it to see its details page,
> then look at the notes in the "Annotations" section and also at the
> details page's re-alignment of flanking sequences to the neighboring
> genome sequence.  Lots of gaps in the alignment of the 5' flanking
> sequence... so my confidence in the mapping is reduced.
>
>
> > Next, a few questions about SNP records of class "in-del":
> >
> > 1) Is the identity of the ancestral allele invoked in further
> > classifying a class "in-del" SNP as either class "insertion" or class
> > "deletion"?  If not, what is the basis/purpose of this
> > subclassification?
>
> Insertion and deletion are UCSC's local additions.  dbSNP has only
> class in-del.  In some cases, we see that the reference genome has 0
> bases but some observed alleles are >0 bases.  All of those would be
> an insertion into the reference genome, so we call it an insertion.
> Conversely, sometimes the reference allele has more bases than any
> observed allele (except itself).  If dbSNP calls it an in-del, we call
> it a deletion.
>
>
> > 2) Just as for class "single" entries, "-" may appear as the value of
> > the 'refUCSC' field, but not as a value of 'observed' for that entry.
> > Are such cases always also of class "insertion"?
>
> Yes.
>
>
> > 3) When the values of 'chromStart' and 'chromEnd' are equal, the
> > value of 'observed' appears to always be "lengthTooLong";
>
> Not always -- e.g. rs56289060 has chromStart==chromEnd, class
> insertion, but observed is -/C.
>
>
> > by contrast, when the 'chromStart' and 'chromEnd' values are not
> > equal, the value of observed may or may not be "lengthTooLong".
>
> Yes.
>
>
> > Is every entry with "lengthTooLong" in the observed field to be
> > interpreted as an allelism in which the two possible allele states
> > are a too-long-to-be-reliably-sequenced motif versus a gap?
>
> I believe 'lengthTooLong' means too long for the file format from
> which we grab it.
>
>
> > Is the specific nucleotide sequence of that motif stored somewhere
> > in the database?  If not, how, if at all, can we find its value?
>
> It might be stored in one of the many tables of dbSNP that we do not
> use -- you can ask the dbSNP team at [email protected] .
>
>
> > 4) How, if at all, does the value of the 'strand' field affect the
> > interpretation of the "lengthTooLong" value listed in the observed
> > field, and/or of the "-" value listed in the 'refUCSC' field?
>
> It doesn't -- if we don't know the observed alleles, then we have
> nothing to reverse-complement.  If the re-alignment of flanking
> sequences to the reference genome looks reasonable, then perhaps
> refUCSC is one of the observed alleles.
>
>
> > 5) In some cases (e.g., rs10605661), the 'observed' field contains a
> > "-" value and/or a multinucleotide value, but the 'refUCSC' field
> > contains only a single-base value.  Why is the 'refUCSC' value not one
> > of the values listed in 'observed'?
>
> Again, it all boils down to dbSNP's mapping of the submitted flanking
> sequences to the reference genome.
>
>
> > 6) Is the variable segment in a class "in-del" SNP always the segment
> > that starts at position 'chromStart' + 1 and continues through
> > position 'chromEnd' (even when the value of strand is "-"), or are
> > there other rules for inferring exactly which positions vary?
>
> In the intuitive 1-based, fully closed numbering system, yes, the
> mapped variable reason of a SNP of *any* class is chromStart+1 to
> chromEnd.
>
>
> > For class "insertion":
> >
> > Are the following inferences right (and, if not, please advise re.
> correct interpretation)?:
> >
> > 1) Every class "insertion" SNP has exactly two allele state values in
> 'observed'.
>
> There is no theoretical reason why that should be absolute, but
> interestingly that holds for snp129.  It does not hold for snp130 --
> e.g. the new-in-130 rs72542761 is an insertion into the reference
> genome (chromStart=chromEnd) with observed = C/T/TTACTGA.
>
>
> > 2) "-" appears as the value of 'refUCSC', and as a value of
> > 'observed', if and only if the value of 'chromStart' equals the value
> > of 'chromEnd'.
>
> If chromStart==chromEnd, "-" is what we put in refUCSC by convention.
> "-" may or may not appear in observed.  (again, snp129 may differ from
> snp130 here.)
>
>
> > 3) If the value of 'chromStart' does not equal the value of
> > 'chromEnd', and the length of some non-'refUCSC' allele listed in
> > 'observed' equals the quantity 'chromEnd' - 'chromStart', then the
> > subject and reference genomes align with no local gap, and the subject
> > genome has that non-'refUCSC' allele substituted for the reference
> > positions 'chromStart+1' to 'chromEnd'.
>
> Yes.
>
>
> > 4) If the value of 'chromStart' does not equal the value of 'chromEnd,
> > and the length of some non-'refUCSC' allele listed in 'observed'
> > exceeds the quantity 'chromEnd - chromStart', then the reference
> > genome contains a local gap relative to the subject genome, and the
> > subject genome has that non-refUCSC allele substituted for
> > gap-inclusive reference positions 'chromStart+1' to 'chromEnd'.
>
> Yes, it is a substitution.  (It's possible that the reference genome
> allele could be a subsequence of the observed -- for example, if
> refUCSC is GT and the observed is GTAA.  In that case, an alignment
> tool would probably not use the gap character, but would call it a
> pure insertion of AA.)
>
>
> > For class "deletion":
> >
> > Are the following inferences right (and, if not, please advise
> > re. correct interpretation)?:
> >
> > 1) No class "deletion" SNP has equal 'chromStart' and 'chromEnd'
> > values.
>
> Yes.
>
>
> > 2) No class "deletion" SNP has "-" as a 'refUCSC' value.
>
> Yes (corollary of 1).
>
>
> > 3) Every class "deletion" SNP with "-" as a value in 'observed' has
> > exactly one other 'observed' value; other cases (in which "-" is not
> > listed as a value in 'observed') may have more than two possible
> > allele states listed in 'observed'.
>
> This seems to hold for snp129 and snp130 but I would not count on it
> to hold forever -- someday there could well be a deletion SNP with
> "-" and two other observed alleles.
>
>
> > 4) If the length of some non-'refUCSC' allele listed in 'observed'
> > equals the quantity 'chromEnd' - 'chromStart', then the subject and
> > reference genomes align with no local gap, and the subject genome has
> > that non-'refUCSC' allele substituted for reference positions
> > 'chromStart+1' to 'chromEnd'.
>
> Yes.
>
>
> > 5) If the length of some non-'refUCSC' allele listed in 'observed' is
> > less than the quantity 'chromEnd - chromStart', then the subject
> > genome contains a local gap relative to the reference genome, and the
> > subject genome has that non-'refUCSC' allele substituted for reference
> > positions 'chromStart+1' to 'chromEnd'.
>
> Yes, it is a substitution of unequal sizes, but as above, the smaller
> sequence could be a subsequence of the larger sequence.
>
> Hope that helps, and please send more questions to [email protected]
> as you have them.
>
> Angie
>
>
>
>


-- 
Nathaniel Pearson, PhD
Bioinformatics Scientist
Knome, Inc.
101 Main St, Fl 16
Cambridge, MA 02142
USA
-------
Tel. 617.528.2157
Fax 617.528.2199
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] pre-query on SNP data schema

Reply via email to