Re: [Genome] pre-query on SNP data schema

Angie Hinrichs Thu, 10 Sep 2009 11:39:42 -0700

Hi Nathan, 

I will answer each of your questions below, but I'll begin with some 
information that applies to many of the questions.


First, start and end coordinates in any UCSC database table, including 
snp tables regardless of class, use a numbering scheme that we call 
0-based, half open, as opposed to the more intuitive 1-based, fully 
closed numbering system that is most commonly used elsewhere. 0-based 
means that the first base of the chromosome is 0. Half open means 
that the end coordinate is for the base after the last base included 
in the region (end = 0-based index of last base + 1). That addition 
of 1 due to the open end makes the end coordinates appear 1-based. In 
order to convert UCSC coordinates to 1-based, fully closed, simply add 
1 to the start coord. 

We use that numbering system internally because it makes coordinate 
arithmetic easier, and that reduces the number of bugs in our code. 
0-based numbering is more natural for programmers. Due to the 
half-open coords, the size of an item is end-start, the start of an 
intron is equal to the end of the previous exon, and so on -- none of 
the +1 / -1 fencepost conditions of fully closed coords. 

Coordinates displayed in the Genome Browser web pages are 1-based, 
fully closed as people would naturally expect -- we add 1 to 
chromStart when printing out a position range, to mask the 
non-intuitive numbering system of our database tables. 

Second, the inconsistencies between refUCSC and class/observed make 
more sense in light of how those values are independently derived. 
Also, we flag those inconsistencies in the table snpNNNExceptions 
(snp129Exceptions, snp130Exceptions), and snpNNNExceptions can be used 
to filter out suspect mappings from snpNNN. 

The fundamental data in dbSNP are submissions from wet labs (or 
consortia) that have observed polymorphisms when sequencing multiple 
individuals. The labs submit flanking sequences (sequences to the 
left and right of the polymorphic site), observed alleles, and 
information about the sample population, technology etc. Each 
submission is given an ID that begins with ss (submitted SNP) and ends 
with a number. 

dbSNP then maps all flanking sequences to the reference genome and 
several alternate reference genomes using a complex process -- flanks 
are aligned separately and then custom-processed back into pairs; then 
each submitted SNP's mapping is whatever falls between its paired 
flank mappings on the reference or alternate genome. Using the 
mappings of submitted SNPs, dbSNP clusters the submitted SNPs into 
reference SNPs. Each reference SNP has a stable ID composed of "rs" 
and a number. 

The observed alleles in our snpNNN tables are taken directly from the 
original submissions. I believe class is a function of the observed 
alleles collected from all submitted SNPs in each reference SNP 
cluster. However, refUCSC is simply the genomic bases that appear at 
dbSNP's mapped coordinates. refUCSC is not a function of observed 
alleles, but rather of the submitted flanking sequences and dbSNP's 
mapping process. Inconsistencies do arise, and then more 
investigation is needed in order to determine which piece of 
information is wrong. 

dbSNP itself includes some measures of confidence in each SNP mapping. 
Each reference SNP is assigned a weight: 1 means that it has a unique 
mapping, 2 means that it has a couple or a few, and 3 means that it 
has many mappings. Multiple mappings reduce confidence in a SNP -- 
are we seeing real polymorphism, or just almost-identical pieces of 
the genome? dbSNP also describes its alignments with a "locus type" 
(locType in our snpNNN tables), with 6 possible values. The first 3 
simply describe whether there are >1, 1 or 0 bases between the mapped 
flanks. Inconsistencies between class and locType (and thus refUCSC 
as you have noted) are easy to identify, and we flag then in 
snpNNNExceptions. The latter 3 locType values are more indicative of 
difficulty in mapping: they indicate that there was a gap in the 
alignment of flanking sequence adjacent to the polymorphic site. We 
flag the occurrences of the latter 3 types in snpNNNExceptions, and on 
the genome browser details page for a SNP, we suggest that the user 
inspect our local re-alignment of the flanking sequences to the 
neighboring genomic sequence. Sometimes it appears that a better 
mapping would have been possible, and perhaps the given coordinates 
are not correct. 

Finally, UCSC takes only a slice of the massive dbSNP database: we 
show only the clustered reference SNPs that have been mapped to the 
reference genome on which the browser is built, with a weight of 1, 2 
or 3. We discard mappings of SNPs to alternates such as the Celera or 
Venter genome sequences, and we discard SNPs that are not mapped to 
any ref/alt, or SNPs that are mapped to so many locations that they 
are assigned a weight greater than 3. Also, we store a small subset 
of the types of data stored in dbSNP. If you download the entire 
dbSNP, you can drill down further into the evidence for each SNP but 
the learning curve is even steeper. 


> In our group's efforts to accurately parse UCSC human SNP records, 
> several small puzzles have emerged. First, a few general questions 
> about SNP records of any class: 
> 
> 1) As a value in the 'refUCSC' field, does "-" always simply denote a 
> gap relative to a subject (or other some other reference) genome? 

Yes. Ultimately it means that the flanking sequence alignments are 
contiguous on the reference genome. 


> 2) Is the value in the 'refUCSC' field, or its reverse complement (if 
> and only if "-" is the value of 'strand' but not the value of 
> 'refUCSC'?), always at least an implied value of the 'observed' field, 
> in addition to any other value(s) listed there? That is, for entries 
> where the values of 'observed' do not include either the value of 
> 'refUCSC' or its reverse complement, are we to presume that the 
> missing value would be included in a more verbose population of the 
> 'observed' field? 

Yes -- or that there might have been an issue in mapping flanking 
sequences to the genome. 


> 3) Are all values for 'observed' given as part of the strand specified 
> in the 'strand' field, while values in 'refUCSC' are given as part of 
> the plus strand, regardless of the value listed in the 'strand' field? 

Yes -- observed sequences come from submitters, and they may align to 
either strand of the reference genome. 


> 4) What mapping, if any, holds between the allele state listed in 
> 'refUCSC' and the ancestral (versus derived) allele state for that 
> SNP? 

I would not expect a correlation. The reference genome is a mosaic of 
ten individual genomes -- better than one, but still very few samples. 
Without an outgroup to human, I don't think we can determine ancestral 
state anyway. Finally, the accuracy of refUCSC also depends on the 
accuracy of the mapping of flanking sequences, so anything odd about 
the mapping reduces confidence in chromStart, chromEnd and refUCSC. 

We have another track from Human Genome Diversity Project (HGDP Allele 
Freq), where they sampled fewer SNPs (~660,000) on many individuals, 
and they attempted to guess an ancestral allele -- but by using the 
allele found in alignments of the human reference genome to the 
reference genome of a single chimp, IIRC. 

I believe several research groups have made more sophisticated 
attempts to determine ancestral states, including researchers in 
UCSC's own Haussler group, and Javier Herrero at Sanger. 


> Next a few questions about SNP records of class "single": 
> 
> 1) For a class "single" SNP entry, it appears that the value of 
> 'chromStart' equals that of 'chromEnd' if and only if the value of 
> 'refUCSC' is "-" (e.g., rs3542401, rs1755135). Yet the allele states 
> (always multiple) listed in the observed field never include "-", but 
> instead always appear as single bases (e.g., "A/T"). How does the "-" 
> value in 'refUCSC' relate to the multiple allele state values in 
> 'observed'? 

Class seems to be a function of observed -- class is single if and 
only if all observed alleles are single-base. (Note: sometimes IUPAC 
ambiguous bases such as R, K etc are given.) However, chromStart and 
chromEnd are determined by dbSNP from their alignment of flanking 
sequences to the reference genome. The bases between those 
coordinates determine refUCSC. When the reported SNP and the mapping 
to the reference produce inconsistent values, that reduces confidence 
in the mapped SNP. 

When class is single but refUCSC is not a single base, an exception 
(either SingleClassZeroSpan or SingleClassLongerSpan) is stored in 
snpNNNExceptions, and described on the SNP details page. 


> 2) Given that the values of 'chromStart' and 'chromEnd' values are 
> equal only where the value of 'refUCSC' is "-", are we right to infer 
> that such cases represent single-base insertions/deletions, while all 
> other class "single" cases represent single-base substitutions? If 
> this interpretation is right, why is SNP rs17551353 (strand = -; 
> refUCSC = -; observed = C/G) classified as class "single", while SNP 
> rs28383030 (strand = "-"; refUCSC = "A"; observed = "-/T") is 
> classified as class "in-del"? 

Class is determined from observed before the mapping to the reference 
genome. refUCSC is a function of the mapping of flanking sequences. 
They can have inconsistent results, and more info is needed to resolve 
each case. 


> 3) In some class "single" entries (e.g., rs5869813, rs61556558), the 
> value of 'refUCSC' is a multibase string, but each allele state in 
> 'observed' is a single base. How are such entries (specifically, the 
> multibase value of 'refUCSC') to be interpreted, especially for 
> parsing which (class =...) "single" base is the site of variation? In 
> what cases, if any (or all), are we to infer that the site of 
> variation for a class "single" SNP is 'chromStart'+1? 

In those cases, dbSNP's mapping would have us believe that *all* bases 
in refUCSC are replaced by the single bases in observed, i.e. the SNP 
is a deletion from the reference genome and the reference genome's 
allele simply wasn't reported by any submitters. Interestingly, your 
two example SNPs are in build 129 but not build 130 -- they both have 
been merged into overlapping single-base SNPs whose mappings are 
single-base as expected. I think this means that dbSNP identified and 
corrected a problem either with some flanking sequences or with their 
mapping algorithm. 

However, snp130 still has examples such as rs72497839. If you view 
that in the genome browser, and click on it to see its details page, 
then look at the notes in the "Annotations" section and also at the 
details page's re-alignment of flanking sequences to the neighboring 
genome sequence. Lots of gaps in the alignment of the 5' flanking 
sequence... so my confidence in the mapping is reduced. 


> Next, a few questions about SNP records of class "in-del": 
> 
> 1) Is the identity of the ancestral allele invoked in further 
> classifying a class "in-del" SNP as either class "insertion" or class 
> "deletion"? If not, what is the basis/purpose of this 
> subclassification? 

Insertion and deletion are UCSC's local additions. dbSNP has only 
class in-del. In some cases, we see that the reference genome has 0 
bases but some observed alleles are >0 bases. All of those would be 
an insertion into the reference genome, so we call it an insertion. 
Conversely, sometimes the reference allele has more bases than any 
observed allele (except itself). If dbSNP calls it an in-del, we call 
it a deletion. 


> 2) Just as for class "single" entries, "-" may appear as the value of 
> the 'refUCSC' field, but not as a value of 'observed' for that entry. 
> Are such cases always also of class "insertion"? 

Yes. 


> 3) When the values of 'chromStart' and 'chromEnd' are equal, the 
> value of 'observed' appears to always be "lengthTooLong"; 

Not always -- e.g. rs56289060 has chromStart==chromEnd, class 
insertion, but observed is -/C. 


> by contrast, when the 'chromStart' and 'chromEnd' values are not 
> equal, the value of observed may or may not be "lengthTooLong". 

Yes. 


> Is every entry with "lengthTooLong" in the observed field to be 
> interpreted as an allelism in which the two possible allele states 
> are a too-long-to-be-reliably-sequenced motif versus a gap? 

I believe 'lengthTooLong' means too long for the file format from 
which we grab it. 


> Is the specific nucleotide sequence of that motif stored somewhere 
> in the database? If not, how, if at all, can we find its value? 

It might be stored in one of the many tables of dbSNP that we do not 
use -- you can ask the dbSNP team at [email protected] . 


> 4) How, if at all, does the value of the 'strand' field affect the 
> interpretation of the "lengthTooLong" value listed in the observed 
> field, and/or of the "-" value listed in the 'refUCSC' field? 

It doesn't -- if we don't know the observed alleles, then we have 
nothing to reverse-complement. If the re-alignment of flanking 
sequences to the reference genome looks reasonable, then perhaps 
refUCSC is one of the observed alleles. 


> 5) In some cases (e.g., rs10605661), the 'observed' field contains a 
> "-" value and/or a multinucleotide value, but the 'refUCSC' field 
> contains only a single-base value. Why is the 'refUCSC' value not one 
> of the values listed in 'observed'? 

Again, it all boils down to dbSNP's mapping of the submitted flanking 
sequences to the reference genome. 


> 6) Is the variable segment in a class "in-del" SNP always the segment 
> that starts at position 'chromStart' + 1 and continues through 
> position 'chromEnd' (even when the value of strand is "-"), or are 
> there other rules for inferring exactly which positions vary? 

In the intuitive 1-based, fully closed numbering system, yes, the 
mapped variable reason of a SNP of *any* class is chromStart+1 to 
chromEnd. 


> For class "insertion": 
> 
> Are the following inferences right (and, if not, please advise re. correct 
> interpretation)?: 
> 
> 1) Every class "insertion" SNP has exactly two allele state values in 
> 'observed'. 

There is no theoretical reason why that should be absolute, but 
interestingly that holds for snp129. It does not hold for snp130 -- 
e.g. the new-in-130 rs72542761 is an insertion into the reference 
genome (chromStart=chromEnd) with observed = C/T/TTACTGA. 


> 2) "-" appears as the value of 'refUCSC', and as a value of 
> 'observed', if and only if the value of 'chromStart' equals the value 
> of 'chromEnd'. 

If chromStart==chromEnd, "-" is what we put in refUCSC by convention. 
"-" may or may not appear in observed. (again, snp129 may differ from 
snp130 here.) 


> 3) If the value of 'chromStart' does not equal the value of 
> 'chromEnd', and the length of some non-'refUCSC' allele listed in 
> 'observed' equals the quantity 'chromEnd' - 'chromStart', then the 
> subject and reference genomes align with no local gap, and the subject 
> genome has that non-'refUCSC' allele substituted for the reference 
> positions 'chromStart+1' to 'chromEnd'. 

Yes. 


> 4) If the value of 'chromStart' does not equal the value of 'chromEnd, 
> and the length of some non-'refUCSC' allele listed in 'observed' 
> exceeds the quantity 'chromEnd - chromStart', then the reference 
> genome contains a local gap relative to the subject genome, and the 
> subject genome has that non-refUCSC allele substituted for 
> gap-inclusive reference positions 'chromStart+1' to 'chromEnd'. 

Yes, it is a substitution. (It's possible that the reference genome 
allele could be a subsequence of the observed -- for example, if 
refUCSC is GT and the observed is GTAA. In that case, an alignment 
tool would probably not use the gap character, but would call it a 
pure insertion of AA.) 


> For class "deletion": 
> 
> Are the following inferences right (and, if not, please advise 
> re. correct interpretation)?: 
> 
> 1) No class "deletion" SNP has equal 'chromStart' and 'chromEnd' 
> values. 

Yes. 


> 2) No class "deletion" SNP has "-" as a 'refUCSC' value. 

Yes (corollary of 1). 


> 3) Every class "deletion" SNP with "-" as a value in 'observed' has 
> exactly one other 'observed' value; other cases (in which "-" is not 
> listed as a value in 'observed') may have more than two possible 
> allele states listed in 'observed'. 

This seems to hold for snp129 and snp130 but I would not count on it 
to hold forever -- someday there could well be a deletion SNP with 
"-" and two other observed alleles. 


> 4) If the length of some non-'refUCSC' allele listed in 'observed' 
> equals the quantity 'chromEnd' - 'chromStart', then the subject and 
> reference genomes align with no local gap, and the subject genome has 
> that non-'refUCSC' allele substituted for reference positions 
> 'chromStart+1' to 'chromEnd'. 

Yes. 


> 5) If the length of some non-'refUCSC' allele listed in 'observed' is 
> less than the quantity 'chromEnd - chromStart', then the subject 
> genome contains a local gap relative to the reference genome, and the 
> subject genome has that non-'refUCSC' allele substituted for reference 
> positions 'chromStart+1' to 'chromEnd'. 

Yes, it is a substitution of unequal sizes, but as above, the smaller 
sequence could be a subsequence of the larger sequence. 

Hope that helps, and please send more questions to [email protected] 
as you have them. 

Angie 



_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] pre-query on SNP data schema

Reply via email to