Hello David, I'll try explain per example. I'll use hg19 (& hg18 and note if they differ). Please note that the dbSNP release for hg19 is still considered provisional. And if there are problems in the data, dbSNP would need to adjust the actual source data there - we only place the data on the genome and add in some descriptions.
Case A - rs3223599 You are correct, the CA repeat is part of the reference sequence. The genomic flanking sequence for the SNP is not quite exactly what would be expected. It looks as if both the C and A should be noted as the SNP position, not just the C, which would leave the leading A out of the right flanking sequence. The alignment of the flaking sequence to the reference genome sequence shows this. My guess is that the repeats caused some problems with placing the SNP. From a practical perspective, the SNP covers all of the bases of the CA repeat (all 48) observed in the reference sequence and the variability is in that region (the observed frequency of the CA repeat, of which 24 is one of them). Since the an observed is never zero or a single "CA", the the SNP placement and the flaking sequence are confused. Hg18 is the same, different coordinates, but a similar SNP description. Case B - rs3220726 Almost the same case, just the reverse, where the ms is in the left (not right) flanking sequence. Only one base is noted as the reference allele (C), when the variation is a repeating "CA". This one also has the problem that the SNP position is inside of a CAT as compared to the genomic reference sequence, not the start of the actual CA repeating block. The BLAT alignment of the flanking sequence was done the same way as in case A. When the flanking sequence is organized this way, the ms coordinates could end up at the start or end (or in this case, slightly before) the actual start of the ms. The repetitive sequence is difficult to align to. It would be better if at least one observed was in the "allele" field. Question 1: For either of these, there is not much to do except to adjust the coordinates yourself and perhaps submit the data to dbSNP. There isn't another data source. It may take some hand editing to perfect the coordinate positions, until dbSNP has a chance to edit it. Case C - rs3222966 Better, since one of the actual observed is noted as the allele, but it does have a problem with a missing base. If you actually look at the genome sequence, there isn't another A to make the full 48 bases. There is however a leading A before the SNP starts. So, is this an AC repeat? Is the actual observed really only 23 copies? Perhaps the G after the ms is a bad base call? Question 2: Agreed, the lengthTooLong is not very helpful. Although all three of these are obviously describing the same type of feature, they all have little problems with how they are modeled in the data. Again, you will need to make repairs to the data itself for your own use and consider submitting the evidence to dbSNP, for them to use for a correction. Case D - rs3219614 Question 3: The /A/T are supposed to represent alternate alleles, but I think this is probably an error. The ms starts after this base (the allele coordinate is one base too small). Many of the same issues you noted in earlier cases apply, with a this new wrinkle. I only count 21 observed CA repeats in the genome sequence (and in the flanking sequence) but actually only 16, if the CAT towards the end is the true end of the feature (not likely, poor base calling? Probably.) You would need to see all the evidence for each of the observed in place with flanking sequence to know for certain. Question 4: A good question. The reference is the genome the SNP is place upon, which as you note, only has 21 copies. hg18 is the same. Another problem. Mostly what I can do is confirm that what you are seeing is the same as what I can see here. All of these are the same class of variation and should be formatted in the same way to facilitate analysis. This is the data from dbSNP - so any changes/corrections would need to flow from them. I hope I helped a bit, Jennifer ------------------------------------------------ Jennifer Jackson UCSC Genome Bioinformatics Group ----- "David Gordon" <[email protected]> wrote: > From: "David Gordon" <[email protected]> > To: [email protected] > Cc: "David Gordon" <[email protected]> > Sent: Thursday, December 10, 2009 5:51:39 PM GMT -08:00 US/Canada Pacific > Subject: [Genome] questions on microsatellites in snp130.txt download > > Dear UCSC, > > I've looked through the archives so I think my question hasn't yet > been answered. > > I'm looking at microsatellites in the snp130.txt file. I am trying > to > make sense of the coordinates. In many case the coordinates of a > microsatellite refer to a single base (chromEnd = chromStart + 1). > Such is the cases A and B below. But where is the microsatellite? > According to the alignments (by clicking on the rs... name), in case > A > the indicated microsatellite (the black bar in the browser with > snp130 > set to "full") is at the *end* of the CA repeat (the actual > microsatellite). In case B, the indicated microsatellite is at the > *beginning* of the CA repeat. Both of these are top strand snps. > > Case A. > > 627 chr1 5576651 5576652 rs3223599 0 + C > C (CA)19/20/21/22/23/24 genomic microsatellite by-frequency > 0.752086 0.089764 unknown exact 1 > > The genome browser shows the entire microsatellite > repeat (all 24 copies of CA, so 48 bases) as the reference > sequence. The position 5576652 marks the *end* of the CA repeat. The > browser just shows the microsatellite as a single base. > > Case B: > > 658 chr1 9585594 9585595 rs3220726 0 + C > C lengthTooLong genomic microsatellite by-frequency > 0.8126 0.129764 unknown exact 1 > > The genome browser shows base at 1-position 9585595 in this case is > at > the *left* (beginning) of the CA repeat. This repeat is not > particularly long: 58 bases. I don't see any way that I can get this > information from the line above. > > Question 1) > > So how would anyone know, by looking in snp130.txt, where the actual > microsatellite is? Is there some other table that I could download > that would give this information? > > In case C, the coordinates given are the actual coordinates of the > microsatellite. > > Case C: > > 753 chr1 22129926 22129973 rs3222966 0 > + CACACACACACACACACACACACACACACACACACACACACACACAC > CACACACACACACACACACACACACACACACACACACACACACACAC > (CA)17/18/19/20/21/22/23/24 genomic microsatellite by-frequency > 0.7524 0.158867 unknown range 1 > > In this case, the microsatellite shows the full coordinates of the > 47-base microsatellite which includes all (but 1/2) of the 24-copy CA > repeat. > > Question 2) > > If the observed is listed as lengthTooLong, is there any way to > determine what the bases of the microsatellite are? (Without > that, they aren't much use.) > > > Case D: > > 852 chr1 35119589 35119590 rs3219614 0 > + T T (CA)20/21/22/23/A/T genomic > microsatellite > by-frequency 0.284918 0.283047 unknown exact 1 > > Question 3) > > In case D, what does the /A/T mean at the end of (CA)20/21/22/23/A/T > ? > > Question 4) > > In case D, the CA repeat starts at position 35119591 (chr1) and ends > at > 35119632, giving 42 bases or 21 copies of the repeat. So why does > the > allele indicate that there are 23 copies? > > Thank you very much! > > David Gordon > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
