Hi Nathan, I will answer each of your questions below, but I'll begin with some information that applies to many of the questions.
First, start and end coordinates in any UCSC database table, including snp tables regardless of class, use a numbering scheme that we call 0-based, half open, as opposed to the more intuitive 1-based, fully closed numbering system that is most commonly used elsewhere. 0-based means that the first base of the chromosome is 0. Half open means that the end coordinate is for the base after the last base included in the region (end = 0-based index of last base + 1). That addition of 1 due to the open end makes the end coordinates appear 1-based. In order to convert UCSC coordinates to 1-based, fully closed, simply add 1 to the start coord. We use that numbering system internally because it makes coordinate arithmetic easier, and that reduces the number of bugs in our code. 0-based numbering is more natural for programmers. Due to the half-open coords, the size of an item is end-start, the start of an intron is equal to the end of the previous exon, and so on -- none of the +1 / -1 fencepost conditions of fully closed coords. Coordinates displayed in the Genome Browser web pages are 1-based, fully closed as people would naturally expect -- we add 1 to chromStart when printing out a position range, to mask the non-intuitive numbering system of our database tables. Second, the inconsistencies between refUCSC and class/observed make more sense in light of how those values are independently derived. Also, we flag those inconsistencies in the table snpNNNExceptions (snp129Exceptions, snp130Exceptions), and snpNNNExceptions can be used to filter out suspect mappings from snpNNN. The fundamental data in dbSNP are submissions from wet labs (or consortia) that have observed polymorphisms when sequencing multiple individuals. The labs submit flanking sequences (sequences to the left and right of the polymorphic site), observed alleles, and information about the sample population, technology etc. Each submission is given an ID that begins with ss (submitted SNP) and ends with a number. dbSNP then maps all flanking sequences to the reference genome and several alternate reference genomes using a complex process -- flanks are aligned separately and then custom-processed back into pairs; then each submitted SNP's mapping is whatever falls between its paired flank mappings on the reference or alternate genome. Using the mappings of submitted SNPs, dbSNP clusters the submitted SNPs into reference SNPs. Each reference SNP has a stable ID composed of "rs" and a number. The observed alleles in our snpNNN tables are taken directly from the original submissions. I believe class is a function of the observed alleles collected from all submitted SNPs in each reference SNP cluster. However, refUCSC is simply the genomic bases that appear at dbSNP's mapped coordinates. refUCSC is not a function of observed alleles, but rather of the submitted flanking sequences and dbSNP's mapping process. Inconsistencies do arise, and then more investigation is needed in order to determine which piece of information is wrong. dbSNP itself includes some measures of confidence in each SNP mapping. Each reference SNP is assigned a weight: 1 means that it has a unique mapping, 2 means that it has a couple or a few, and 3 means that it has many mappings. Multiple mappings reduce confidence in a SNP -- are we seeing real polymorphism, or just almost-identical pieces of the genome? dbSNP also describes its alignments with a "locus type" (locType in our snpNNN tables), with 6 possible values. The first 3 simply describe whether there are >1, 1 or 0 bases between the mapped flanks. Inconsistencies between class and locType (and thus refUCSC as you have noted) are easy to identify, and we flag then in snpNNNExceptions. The latter 3 locType values are more indicative of difficulty in mapping: they indicate that there was a gap in the alignment of flanking sequence adjacent to the polymorphic site. We flag the occurrences of the latter 3 types in snpNNNExceptions, and on the genome browser details page for a SNP, we suggest that the user inspect our local re-alignment of the flanking sequences to the neighboring genomic sequence. Sometimes it appears that a better mapping would have been possible, and perhaps the given coordinates are not correct. Finally, UCSC takes only a slice of the massive dbSNP database: we show only the clustered reference SNPs that have been mapped to the reference genome on which the browser is built, with a weight of 1, 2 or 3. We discard mappings of SNPs to alternates such as the Celera or Venter genome sequences, and we discard SNPs that are not mapped to any ref/alt, or SNPs that are mapped to so many locations that they are assigned a weight greater than 3. Also, we store a small subset of the types of data stored in dbSNP. If you download the entire dbSNP, you can drill down further into the evidence for each SNP but the learning curve is even steeper. > In our group's efforts to accurately parse UCSC human SNP records, > several small puzzles have emerged. First, a few general questions > about SNP records of any class: > > 1) As a value in the 'refUCSC' field, does "-" always simply denote a > gap relative to a subject (or other some other reference) genome? Yes. Ultimately it means that the flanking sequence alignments are contiguous on the reference genome. > 2) Is the value in the 'refUCSC' field, or its reverse complement (if > and only if "-" is the value of 'strand' but not the value of > 'refUCSC'?), always at least an implied value of the 'observed' field, > in addition to any other value(s) listed there? That is, for entries > where the values of 'observed' do not include either the value of > 'refUCSC' or its reverse complement, are we to presume that the > missing value would be included in a more verbose population of the > 'observed' field? Yes -- or that there might have been an issue in mapping flanking sequences to the genome. > 3) Are all values for 'observed' given as part of the strand specified > in the 'strand' field, while values in 'refUCSC' are given as part of > the plus strand, regardless of the value listed in the 'strand' field? Yes -- observed sequences come from submitters, and they may align to either strand of the reference genome. > 4) What mapping, if any, holds between the allele state listed in > 'refUCSC' and the ancestral (versus derived) allele state for that > SNP? I would not expect a correlation. The reference genome is a mosaic of ten individual genomes -- better than one, but still very few samples. Without an outgroup to human, I don't think we can determine ancestral state anyway. Finally, the accuracy of refUCSC also depends on the accuracy of the mapping of flanking sequences, so anything odd about the mapping reduces confidence in chromStart, chromEnd and refUCSC. We have another track from Human Genome Diversity Project (HGDP Allele Freq), where they sampled fewer SNPs (~660,000) on many individuals, and they attempted to guess an ancestral allele -- but by using the allele found in alignments of the human reference genome to the reference genome of a single chimp, IIRC. I believe several research groups have made more sophisticated attempts to determine ancestral states, including researchers in UCSC's own Haussler group, and Javier Herrero at Sanger. > Next a few questions about SNP records of class "single": > > 1) For a class "single" SNP entry, it appears that the value of > 'chromStart' equals that of 'chromEnd' if and only if the value of > 'refUCSC' is "-" (e.g., rs3542401, rs1755135). Yet the allele states > (always multiple) listed in the observed field never include "-", but > instead always appear as single bases (e.g., "A/T"). How does the "-" > value in 'refUCSC' relate to the multiple allele state values in > 'observed'? Class seems to be a function of observed -- class is single if and only if all observed alleles are single-base. (Note: sometimes IUPAC ambiguous bases such as R, K etc are given.) However, chromStart and chromEnd are determined by dbSNP from their alignment of flanking sequences to the reference genome. The bases between those coordinates determine refUCSC. When the reported SNP and the mapping to the reference produce inconsistent values, that reduces confidence in the mapped SNP. When class is single but refUCSC is not a single base, an exception (either SingleClassZeroSpan or SingleClassLongerSpan) is stored in snpNNNExceptions, and described on the SNP details page. > 2) Given that the values of 'chromStart' and 'chromEnd' values are > equal only where the value of 'refUCSC' is "-", are we right to infer > that such cases represent single-base insertions/deletions, while all > other class "single" cases represent single-base substitutions? If > this interpretation is right, why is SNP rs17551353 (strand = -; > refUCSC = -; observed = C/G) classified as class "single", while SNP > rs28383030 (strand = "-"; refUCSC = "A"; observed = "-/T") is > classified as class "in-del"? Class is determined from observed before the mapping to the reference genome. refUCSC is a function of the mapping of flanking sequences. They can have inconsistent results, and more info is needed to resolve each case. > 3) In some class "single" entries (e.g., rs5869813, rs61556558), the > value of 'refUCSC' is a multibase string, but each allele state in > 'observed' is a single base. How are such entries (specifically, the > multibase value of 'refUCSC') to be interpreted, especially for > parsing which (class =...) "single" base is the site of variation? In > what cases, if any (or all), are we to infer that the site of > variation for a class "single" SNP is 'chromStart'+1? In those cases, dbSNP's mapping would have us believe that *all* bases in refUCSC are replaced by the single bases in observed, i.e. the SNP is a deletion from the reference genome and the reference genome's allele simply wasn't reported by any submitters. Interestingly, your two example SNPs are in build 129 but not build 130 -- they both have been merged into overlapping single-base SNPs whose mappings are single-base as expected. I think this means that dbSNP identified and corrected a problem either with some flanking sequences or with their mapping algorithm. However, snp130 still has examples such as rs72497839. If you view that in the genome browser, and click on it to see its details page, then look at the notes in the "Annotations" section and also at the details page's re-alignment of flanking sequences to the neighboring genome sequence. Lots of gaps in the alignment of the 5' flanking sequence... so my confidence in the mapping is reduced. > Next, a few questions about SNP records of class "in-del": > > 1) Is the identity of the ancestral allele invoked in further > classifying a class "in-del" SNP as either class "insertion" or class > "deletion"? If not, what is the basis/purpose of this > subclassification? Insertion and deletion are UCSC's local additions. dbSNP has only class in-del. In some cases, we see that the reference genome has 0 bases but some observed alleles are >0 bases. All of those would be an insertion into the reference genome, so we call it an insertion. Conversely, sometimes the reference allele has more bases than any observed allele (except itself). If dbSNP calls it an in-del, we call it a deletion. > 2) Just as for class "single" entries, "-" may appear as the value of > the 'refUCSC' field, but not as a value of 'observed' for that entry. > Are such cases always also of class "insertion"? Yes. > 3) When the values of 'chromStart' and 'chromEnd' are equal, the > value of 'observed' appears to always be "lengthTooLong"; Not always -- e.g. rs56289060 has chromStart==chromEnd, class insertion, but observed is -/C. > by contrast, when the 'chromStart' and 'chromEnd' values are not > equal, the value of observed may or may not be "lengthTooLong". Yes. > Is every entry with "lengthTooLong" in the observed field to be > interpreted as an allelism in which the two possible allele states > are a too-long-to-be-reliably-sequenced motif versus a gap? I believe 'lengthTooLong' means too long for the file format from which we grab it. > Is the specific nucleotide sequence of that motif stored somewhere > in the database? If not, how, if at all, can we find its value? It might be stored in one of the many tables of dbSNP that we do not use -- you can ask the dbSNP team at [email protected] . > 4) How, if at all, does the value of the 'strand' field affect the > interpretation of the "lengthTooLong" value listed in the observed > field, and/or of the "-" value listed in the 'refUCSC' field? It doesn't -- if we don't know the observed alleles, then we have nothing to reverse-complement. If the re-alignment of flanking sequences to the reference genome looks reasonable, then perhaps refUCSC is one of the observed alleles. > 5) In some cases (e.g., rs10605661), the 'observed' field contains a > "-" value and/or a multinucleotide value, but the 'refUCSC' field > contains only a single-base value. Why is the 'refUCSC' value not one > of the values listed in 'observed'? Again, it all boils down to dbSNP's mapping of the submitted flanking sequences to the reference genome. > 6) Is the variable segment in a class "in-del" SNP always the segment > that starts at position 'chromStart' + 1 and continues through > position 'chromEnd' (even when the value of strand is "-"), or are > there other rules for inferring exactly which positions vary? In the intuitive 1-based, fully closed numbering system, yes, the mapped variable reason of a SNP of *any* class is chromStart+1 to chromEnd. > For class "insertion": > > Are the following inferences right (and, if not, please advise re. correct > interpretation)?: > > 1) Every class "insertion" SNP has exactly two allele state values in > 'observed'. There is no theoretical reason why that should be absolute, but interestingly that holds for snp129. It does not hold for snp130 -- e.g. the new-in-130 rs72542761 is an insertion into the reference genome (chromStart=chromEnd) with observed = C/T/TTACTGA. > 2) "-" appears as the value of 'refUCSC', and as a value of > 'observed', if and only if the value of 'chromStart' equals the value > of 'chromEnd'. If chromStart==chromEnd, "-" is what we put in refUCSC by convention. "-" may or may not appear in observed. (again, snp129 may differ from snp130 here.) > 3) If the value of 'chromStart' does not equal the value of > 'chromEnd', and the length of some non-'refUCSC' allele listed in > 'observed' equals the quantity 'chromEnd' - 'chromStart', then the > subject and reference genomes align with no local gap, and the subject > genome has that non-'refUCSC' allele substituted for the reference > positions 'chromStart+1' to 'chromEnd'. Yes. > 4) If the value of 'chromStart' does not equal the value of 'chromEnd, > and the length of some non-'refUCSC' allele listed in 'observed' > exceeds the quantity 'chromEnd - chromStart', then the reference > genome contains a local gap relative to the subject genome, and the > subject genome has that non-refUCSC allele substituted for > gap-inclusive reference positions 'chromStart+1' to 'chromEnd'. Yes, it is a substitution. (It's possible that the reference genome allele could be a subsequence of the observed -- for example, if refUCSC is GT and the observed is GTAA. In that case, an alignment tool would probably not use the gap character, but would call it a pure insertion of AA.) > For class "deletion": > > Are the following inferences right (and, if not, please advise > re. correct interpretation)?: > > 1) No class "deletion" SNP has equal 'chromStart' and 'chromEnd' > values. Yes. > 2) No class "deletion" SNP has "-" as a 'refUCSC' value. Yes (corollary of 1). > 3) Every class "deletion" SNP with "-" as a value in 'observed' has > exactly one other 'observed' value; other cases (in which "-" is not > listed as a value in 'observed') may have more than two possible > allele states listed in 'observed'. This seems to hold for snp129 and snp130 but I would not count on it to hold forever -- someday there could well be a deletion SNP with "-" and two other observed alleles. > 4) If the length of some non-'refUCSC' allele listed in 'observed' > equals the quantity 'chromEnd' - 'chromStart', then the subject and > reference genomes align with no local gap, and the subject genome has > that non-'refUCSC' allele substituted for reference positions > 'chromStart+1' to 'chromEnd'. Yes. > 5) If the length of some non-'refUCSC' allele listed in 'observed' is > less than the quantity 'chromEnd - chromStart', then the subject > genome contains a local gap relative to the reference genome, and the > subject genome has that non-'refUCSC' allele substituted for reference > positions 'chromStart+1' to 'chromEnd'. Yes, it is a substitution of unequal sizes, but as above, the smaller sequence could be a subsequence of the larger sequence. Hope that helps, and please send more questions to [email protected] as you have them. Angie _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
