Hi David, The information in the dbSNP(130) track is obtained directly from dbSNP. There are known inconsistencies as you have noticed. In general, your final query should give you the expected results, but it could miss some of the data points you would want and would be expected to contain at least one that you do not want. SNPs are a complicated data type to apply a standard vocabulary to. Through time this will probably all be sorted out.
Meanwhile, UCSC does provide some sanity checking of the data and includes some "exceptions" notations for inconsistencies. These are described on the track description page towards the bottom in the section "UCSC Annotations". If you were to examine a SNP in the browser and clicked through to the description page, the exception would appear on the page. For queries through the Table browser or using files from the Downloads server, look in the table snp130Exceptions (notice the last two fields): Database: hg19 Primary Table: snp130Exceptions Row Count: 1,982,828 Format description: Annotations on data from dbSNP (for version 125 and later). field example SQL type description bin 585 smallint(6) Indexing field to speed chromosome range queries. chrom chr1 varchar(31) Reference sequence chromosome or scaffold chromStart 10259 int(10) unsigned Start position in chrom chromEnd 10260 int(10) unsigned End position in chrom name rs72477211 varchar(15) Reference SNP identifier or Affy SNP name exception ObservedMismatch varchar(63) Exception found for this SNP Other SNP datasets may be also used, especially those that focus on a 1-1 replacement (only in hg18 for now). The recomindation is to review the descriptions for the tracks in the "Variation and Repeats" track grouping, learn about the methods and tools used, and determine if any could be used alone or to suppliment/santiy check the dbSNP results (being careful about any circular path through the datasets - i.e. a SNP that is in dbSNP and donated or inherited by trackX, should not be used to "confirm the dbSNP entry. This may seem obvious, but it could be a bit tricky. The methods sections will help you to avoid these cases. Best of luck with your project, Jennifer ------------------------------------------------ Jennifer Jackson UCSC Genome Bioinformatics Group ----- "David Gacquer" <[email protected]> wrote: > From: "David Gacquer" <[email protected]> > To: [email protected] > Sent: Wednesday, January 27, 2010 7:18:47 AM GMT -08:00 US/Canada Pacific > Subject: [Genome] Questions about the dbSNP130 table > > Hello, > > I have used the UCSC table browser to download the complete snp130 > table > as a tab-separated text file. Since I am only interested in single > nucleotide substitutions, where one nucleotide is replaced by another > > one, I selected from this file only the lines for which the 'class' > field is equal to 'single'. But when I take a look at the resulting > subset of entries, I realize that there are some lines that should not > > appear in the filtered file. > > Here is a copy/paste of some lines for which I need additional > explanations : > > #bin chrom chromStart chromEnd name score strand > > refNCBIrefUCSC observed molType class valid avHet > avHetSE func locTypeweight > > 212 chr1 146668316 146677849 rs2137935 0 + ( > 9533bp > insertion ) ( 9533bp insertion ) C/G genomic single > unknown 0 0 unknown range 3 > 585 chr1 126656 126673 rs72497839 0 - > GCTCGGGCTGACCTCTC GCTCGGGCTGACCTCTC A/C genomic single > > unknown 0 0 unknown range 1 > 585 chr1 92822 92822 rs4317776 0 - - - A/C > > genomic single unknown 0 0 unknown between 3 > 586 chr1 155165 155165 rs1974329 0 - - - > G/T genomic single unknown 0 0 unknown between > 3 > 586 chr1 148894 148895 rs4111311 0 - C C > G/T genomic single unknown 0 0 unknown > rangeDeletion 3 > > For the first two lines, I think I get the point: since a group of > nucleotides is replaced by a single one, then the entry is given the > class 'single' but the 'locType' field is set to 'range', because it > is > a range of nucleotides which is actually replaced by a single one. So > > the first two lines should be correct. > > However for the third and forth lines, I do not understand why the > class is 'single' since apparently they are insertions and should have > > the 'insertion' class in the database. > > And finally, for the fifth line, I do not understand why the 'locType' > > field is 'rangeDeletion' since apparently it is a single nucleotide > substitution and the value of 'loctype' should be 'exact'. > > Are there minor mistakes in the snp130 table or did I miss something > about the classification of the entries ? > > And consequently, if I want to extract only single nucleotide > substitutions, where a single nucleotide is replaced by another single > > nucleotide, should I select entries for which the 'class' and > 'locType' > fields are respectively equal to 'single' and 'exact' ? Or is there a > > possibility that undesired entries can pass this filter ? > > Thank you for reading me > > Best regards > > David > > -- > David Gacquer, Ph. D. > > IRIBHM - Universite Libre de Bruxelles > Bldg C, room C.4.117 > ULB, Campus Erasme, CP602 > 808 route de Lennik > B-1070 Brussels > Belgium > > Phone: +32-2-555 4187 > Fax: +32-2-555 4655 > E-mail: dgacquer at ulb.ac.be > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
