Hi Angie, Wow, thanks for your informative replies! My turn to digest for a bit, to make sure I understand everything.
Thanks, Nathan On Thu, Sep 10, 2009 at 2:39 PM, Angie Hinrichs <[email protected]> wrote: > Hi Nathan, > > I will answer each of your questions below, but I'll begin with some > information that applies to many of the questions. > > First, start and end coordinates in any UCSC database table, including > snp tables regardless of class, use a numbering scheme that we call > 0-based, half open, as opposed to the more intuitive 1-based, fully > closed numbering system that is most commonly used elsewhere. 0-based > means that the first base of the chromosome is 0. Half open means > that the end coordinate is for the base after the last base included > in the region (end = 0-based index of last base + 1). That addition > of 1 due to the open end makes the end coordinates appear 1-based. In > order to convert UCSC coordinates to 1-based, fully closed, simply add > 1 to the start coord. > > We use that numbering system internally because it makes coordinate > arithmetic easier, and that reduces the number of bugs in our code. > 0-based numbering is more natural for programmers. Due to the > half-open coords, the size of an item is end-start, the start of an > intron is equal to the end of the previous exon, and so on -- none of > the +1 / -1 fencepost conditions of fully closed coords. > > Coordinates displayed in the Genome Browser web pages are 1-based, > fully closed as people would naturally expect -- we add 1 to > chromStart when printing out a position range, to mask the > non-intuitive numbering system of our database tables. > > Second, the inconsistencies between refUCSC and class/observed make > more sense in light of how those values are independently derived. > Also, we flag those inconsistencies in the table snpNNNExceptions > (snp129Exceptions, snp130Exceptions), and snpNNNExceptions can be used > to filter out suspect mappings from snpNNN. > > The fundamental data in dbSNP are submissions from wet labs (or > consortia) that have observed polymorphisms when sequencing multiple > individuals. The labs submit flanking sequences (sequences to the > left and right of the polymorphic site), observed alleles, and > information about the sample population, technology etc. Each > submission is given an ID that begins with ss (submitted SNP) and ends > with a number. > > dbSNP then maps all flanking sequences to the reference genome and > several alternate reference genomes using a complex process -- flanks > are aligned separately and then custom-processed back into pairs; then > each submitted SNP's mapping is whatever falls between its paired > flank mappings on the reference or alternate genome. Using the > mappings of submitted SNPs, dbSNP clusters the submitted SNPs into > reference SNPs. Each reference SNP has a stable ID composed of "rs" > and a number. > > The observed alleles in our snpNNN tables are taken directly from the > original submissions. I believe class is a function of the observed > alleles collected from all submitted SNPs in each reference SNP > cluster. However, refUCSC is simply the genomic bases that appear at > dbSNP's mapped coordinates. refUCSC is not a function of observed > alleles, but rather of the submitted flanking sequences and dbSNP's > mapping process. Inconsistencies do arise, and then more > investigation is needed in order to determine which piece of > information is wrong. > > dbSNP itself includes some measures of confidence in each SNP mapping. > Each reference SNP is assigned a weight: 1 means that it has a unique > mapping, 2 means that it has a couple or a few, and 3 means that it > has many mappings. Multiple mappings reduce confidence in a SNP -- > are we seeing real polymorphism, or just almost-identical pieces of > the genome? dbSNP also describes its alignments with a "locus type" > (locType in our snpNNN tables), with 6 possible values. The first 3 > simply describe whether there are >1, 1 or 0 bases between the mapped > flanks. Inconsistencies between class and locType (and thus refUCSC > as you have noted) are easy to identify, and we flag then in > snpNNNExceptions. The latter 3 locType values are more indicative of > difficulty in mapping: they indicate that there was a gap in the > alignment of flanking sequence adjacent to the polymorphic site. We > flag the occurrences of the latter 3 types in snpNNNExceptions, and on > the genome browser details page for a SNP, we suggest that the user > inspect our local re-alignment of the flanking sequences to the > neighboring genomic sequence. Sometimes it appears that a better > mapping would have been possible, and perhaps the given coordinates > are not correct. > > Finally, UCSC takes only a slice of the massive dbSNP database: we > show only the clustered reference SNPs that have been mapped to the > reference genome on which the browser is built, with a weight of 1, 2 > or 3. We discard mappings of SNPs to alternates such as the Celera or > Venter genome sequences, and we discard SNPs that are not mapped to > any ref/alt, or SNPs that are mapped to so many locations that they > are assigned a weight greater than 3. Also, we store a small subset > of the types of data stored in dbSNP. If you download the entire > dbSNP, you can drill down further into the evidence for each SNP but > the learning curve is even steeper. > > > > In our group's efforts to accurately parse UCSC human SNP records, > > several small puzzles have emerged. First, a few general questions > > about SNP records of any class: > > > > 1) As a value in the 'refUCSC' field, does "-" always simply denote a > > gap relative to a subject (or other some other reference) genome? > > Yes. Ultimately it means that the flanking sequence alignments are > contiguous on the reference genome. > > > > 2) Is the value in the 'refUCSC' field, or its reverse complement (if > > and only if "-" is the value of 'strand' but not the value of > > 'refUCSC'?), always at least an implied value of the 'observed' field, > > in addition to any other value(s) listed there? That is, for entries > > where the values of 'observed' do not include either the value of > > 'refUCSC' or its reverse complement, are we to presume that the > > missing value would be included in a more verbose population of the > > 'observed' field? > > Yes -- or that there might have been an issue in mapping flanking > sequences to the genome. > > > > 3) Are all values for 'observed' given as part of the strand specified > > in the 'strand' field, while values in 'refUCSC' are given as part of > > the plus strand, regardless of the value listed in the 'strand' field? > > Yes -- observed sequences come from submitters, and they may align to > either strand of the reference genome. > > > > 4) What mapping, if any, holds between the allele state listed in > > 'refUCSC' and the ancestral (versus derived) allele state for that > > SNP? > > I would not expect a correlation. The reference genome is a mosaic of > ten individual genomes -- better than one, but still very few samples. > Without an outgroup to human, I don't think we can determine ancestral > state anyway. Finally, the accuracy of refUCSC also depends on the > accuracy of the mapping of flanking sequences, so anything odd about > the mapping reduces confidence in chromStart, chromEnd and refUCSC. > > We have another track from Human Genome Diversity Project (HGDP Allele > Freq), where they sampled fewer SNPs (~660,000) on many individuals, > and they attempted to guess an ancestral allele -- but by using the > allele found in alignments of the human reference genome to the > reference genome of a single chimp, IIRC. > > I believe several research groups have made more sophisticated > attempts to determine ancestral states, including researchers in > UCSC's own Haussler group, and Javier Herrero at Sanger. > > > > Next a few questions about SNP records of class "single": > > > > 1) For a class "single" SNP entry, it appears that the value of > > 'chromStart' equals that of 'chromEnd' if and only if the value of > > 'refUCSC' is "-" (e.g., rs3542401, rs1755135). Yet the allele states > > (always multiple) listed in the observed field never include "-", but > > instead always appear as single bases (e.g., "A/T"). How does the "-" > > value in 'refUCSC' relate to the multiple allele state values in > > 'observed'? > > Class seems to be a function of observed -- class is single if and > only if all observed alleles are single-base. (Note: sometimes IUPAC > ambiguous bases such as R, K etc are given.) However, chromStart and > chromEnd are determined by dbSNP from their alignment of flanking > sequences to the reference genome. The bases between those > coordinates determine refUCSC. When the reported SNP and the mapping > to the reference produce inconsistent values, that reduces confidence > in the mapped SNP. > > When class is single but refUCSC is not a single base, an exception > (either SingleClassZeroSpan or SingleClassLongerSpan) is stored in > snpNNNExceptions, and described on the SNP details page. > > > > 2) Given that the values of 'chromStart' and 'chromEnd' values are > > equal only where the value of 'refUCSC' is "-", are we right to infer > > that such cases represent single-base insertions/deletions, while all > > other class "single" cases represent single-base substitutions? If > > this interpretation is right, why is SNP rs17551353 (strand = -; > > refUCSC = -; observed = C/G) classified as class "single", while SNP > > rs28383030 (strand = "-"; refUCSC = "A"; observed = "-/T") is > > classified as class "in-del"? > > Class is determined from observed before the mapping to the reference > genome. refUCSC is a function of the mapping of flanking sequences. > They can have inconsistent results, and more info is needed to resolve > each case. > > > > 3) In some class "single" entries (e.g., rs5869813, rs61556558), the > > value of 'refUCSC' is a multibase string, but each allele state in > > 'observed' is a single base. How are such entries (specifically, the > > multibase value of 'refUCSC') to be interpreted, especially for > > parsing which (class =...) "single" base is the site of variation? In > > what cases, if any (or all), are we to infer that the site of > > variation for a class "single" SNP is 'chromStart'+1? > > In those cases, dbSNP's mapping would have us believe that *all* bases > in refUCSC are replaced by the single bases in observed, i.e. the SNP > is a deletion from the reference genome and the reference genome's > allele simply wasn't reported by any submitters. Interestingly, your > two example SNPs are in build 129 but not build 130 -- they both have > been merged into overlapping single-base SNPs whose mappings are > single-base as expected. I think this means that dbSNP identified and > corrected a problem either with some flanking sequences or with their > mapping algorithm. > > However, snp130 still has examples such as rs72497839. If you view > that in the genome browser, and click on it to see its details page, > then look at the notes in the "Annotations" section and also at the > details page's re-alignment of flanking sequences to the neighboring > genome sequence. Lots of gaps in the alignment of the 5' flanking > sequence... so my confidence in the mapping is reduced. > > > > Next, a few questions about SNP records of class "in-del": > > > > 1) Is the identity of the ancestral allele invoked in further > > classifying a class "in-del" SNP as either class "insertion" or class > > "deletion"? If not, what is the basis/purpose of this > > subclassification? > > Insertion and deletion are UCSC's local additions. dbSNP has only > class in-del. In some cases, we see that the reference genome has 0 > bases but some observed alleles are >0 bases. All of those would be > an insertion into the reference genome, so we call it an insertion. > Conversely, sometimes the reference allele has more bases than any > observed allele (except itself). If dbSNP calls it an in-del, we call > it a deletion. > > > > 2) Just as for class "single" entries, "-" may appear as the value of > > the 'refUCSC' field, but not as a value of 'observed' for that entry. > > Are such cases always also of class "insertion"? > > Yes. > > > > 3) When the values of 'chromStart' and 'chromEnd' are equal, the > > value of 'observed' appears to always be "lengthTooLong"; > > Not always -- e.g. rs56289060 has chromStart==chromEnd, class > insertion, but observed is -/C. > > > > by contrast, when the 'chromStart' and 'chromEnd' values are not > > equal, the value of observed may or may not be "lengthTooLong". > > Yes. > > > > Is every entry with "lengthTooLong" in the observed field to be > > interpreted as an allelism in which the two possible allele states > > are a too-long-to-be-reliably-sequenced motif versus a gap? > > I believe 'lengthTooLong' means too long for the file format from > which we grab it. > > > > Is the specific nucleotide sequence of that motif stored somewhere > > in the database? If not, how, if at all, can we find its value? > > It might be stored in one of the many tables of dbSNP that we do not > use -- you can ask the dbSNP team at [email protected] . > > > > 4) How, if at all, does the value of the 'strand' field affect the > > interpretation of the "lengthTooLong" value listed in the observed > > field, and/or of the "-" value listed in the 'refUCSC' field? > > It doesn't -- if we don't know the observed alleles, then we have > nothing to reverse-complement. If the re-alignment of flanking > sequences to the reference genome looks reasonable, then perhaps > refUCSC is one of the observed alleles. > > > > 5) In some cases (e.g., rs10605661), the 'observed' field contains a > > "-" value and/or a multinucleotide value, but the 'refUCSC' field > > contains only a single-base value. Why is the 'refUCSC' value not one > > of the values listed in 'observed'? > > Again, it all boils down to dbSNP's mapping of the submitted flanking > sequences to the reference genome. > > > > 6) Is the variable segment in a class "in-del" SNP always the segment > > that starts at position 'chromStart' + 1 and continues through > > position 'chromEnd' (even when the value of strand is "-"), or are > > there other rules for inferring exactly which positions vary? > > In the intuitive 1-based, fully closed numbering system, yes, the > mapped variable reason of a SNP of *any* class is chromStart+1 to > chromEnd. > > > > For class "insertion": > > > > Are the following inferences right (and, if not, please advise re. > correct interpretation)?: > > > > 1) Every class "insertion" SNP has exactly two allele state values in > 'observed'. > > There is no theoretical reason why that should be absolute, but > interestingly that holds for snp129. It does not hold for snp130 -- > e.g. the new-in-130 rs72542761 is an insertion into the reference > genome (chromStart=chromEnd) with observed = C/T/TTACTGA. > > > > 2) "-" appears as the value of 'refUCSC', and as a value of > > 'observed', if and only if the value of 'chromStart' equals the value > > of 'chromEnd'. > > If chromStart==chromEnd, "-" is what we put in refUCSC by convention. > "-" may or may not appear in observed. (again, snp129 may differ from > snp130 here.) > > > > 3) If the value of 'chromStart' does not equal the value of > > 'chromEnd', and the length of some non-'refUCSC' allele listed in > > 'observed' equals the quantity 'chromEnd' - 'chromStart', then the > > subject and reference genomes align with no local gap, and the subject > > genome has that non-'refUCSC' allele substituted for the reference > > positions 'chromStart+1' to 'chromEnd'. > > Yes. > > > > 4) If the value of 'chromStart' does not equal the value of 'chromEnd, > > and the length of some non-'refUCSC' allele listed in 'observed' > > exceeds the quantity 'chromEnd - chromStart', then the reference > > genome contains a local gap relative to the subject genome, and the > > subject genome has that non-refUCSC allele substituted for > > gap-inclusive reference positions 'chromStart+1' to 'chromEnd'. > > Yes, it is a substitution. (It's possible that the reference genome > allele could be a subsequence of the observed -- for example, if > refUCSC is GT and the observed is GTAA. In that case, an alignment > tool would probably not use the gap character, but would call it a > pure insertion of AA.) > > > > For class "deletion": > > > > Are the following inferences right (and, if not, please advise > > re. correct interpretation)?: > > > > 1) No class "deletion" SNP has equal 'chromStart' and 'chromEnd' > > values. > > Yes. > > > > 2) No class "deletion" SNP has "-" as a 'refUCSC' value. > > Yes (corollary of 1). > > > > 3) Every class "deletion" SNP with "-" as a value in 'observed' has > > exactly one other 'observed' value; other cases (in which "-" is not > > listed as a value in 'observed') may have more than two possible > > allele states listed in 'observed'. > > This seems to hold for snp129 and snp130 but I would not count on it > to hold forever -- someday there could well be a deletion SNP with > "-" and two other observed alleles. > > > > 4) If the length of some non-'refUCSC' allele listed in 'observed' > > equals the quantity 'chromEnd' - 'chromStart', then the subject and > > reference genomes align with no local gap, and the subject genome has > > that non-'refUCSC' allele substituted for reference positions > > 'chromStart+1' to 'chromEnd'. > > Yes. > > > > 5) If the length of some non-'refUCSC' allele listed in 'observed' is > > less than the quantity 'chromEnd - chromStart', then the subject > > genome contains a local gap relative to the reference genome, and the > > subject genome has that non-'refUCSC' allele substituted for reference > > positions 'chromStart+1' to 'chromEnd'. > > Yes, it is a substitution of unequal sizes, but as above, the smaller > sequence could be a subsequence of the larger sequence. > > Hope that helps, and please send more questions to [email protected] > as you have them. > > Angie > > > > -- Nathaniel Pearson, PhD Bioinformatics Scientist Knome, Inc. 101 Main St, Fl 16 Cambridge, MA 02142 USA ------- Tel. 617.528.2157 Fax 617.528.2199 _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
