Hi David,

The information in the dbSNP(130) track is obtained directly from dbSNP. 
There are known inconsistencies as you have noticed. In general, your
final query should give you the expected results, but it could miss some
of the data points you would want and would be expected to contain at
least one that you do not want. SNPs are a complicated data type
to apply a standard vocabulary to. Through time this will probably all be 
sorted out.

Meanwhile, UCSC does provide some sanity checking of the data and
includes some "exceptions" notations for inconsistencies. These are
described on the track description page towards the bottom in the section
"UCSC Annotations". If you were to examine a SNP in the browser and 
clicked through to the description page, the exception would appear on the 
page. 
For queries through the Table browser or using files from the Downloads server, 
look in the table snp130Exceptions (notice the last two fields):

Database: hg19    Primary Table: snp130Exceptions    Row Count: 1,982,828
Format description: Annotations on data from dbSNP (for version 125 and later).
field   example SQL type        description
bin     585     smallint(6)     Indexing field to speed chromosome range 
queries.
chrom   chr1    varchar(31)     Reference sequence chromosome or scaffold
chromStart      10259   int(10) unsigned        Start position in chrom
chromEnd        10260   int(10) unsigned        End position in chrom
name    rs72477211      varchar(15)     Reference SNP identifier or Affy SNP 
name
exception       ObservedMismatch        varchar(63)     Exception found for 
this SNP


Other SNP datasets may be also used, especially those that focus on a 1-1
replacement (only in hg18 for now). The recomindation is to review the
descriptions for the tracks in the "Variation and Repeats" track grouping,
learn about the methods and tools used, and determine if any could be used
 alone or to suppliment/santiy check the dbSNP results (being careful about
any circular path through the datasets - i.e. a SNP that is in dbSNP and 
donated or inherited by trackX, should not be used to "confirm the dbSNP 
entry. This may seem obvious, but it could be a bit tricky. The methods
sections will help you to avoid these cases.

Best of luck with your project,
Jennifer



------------------------------------------------ 
Jennifer Jackson 
UCSC Genome Bioinformatics Group 

----- "David Gacquer" <[email protected]> wrote:

> From: "David Gacquer" <[email protected]>
> To: [email protected]
> Sent: Wednesday, January 27, 2010 7:18:47 AM GMT -08:00 US/Canada Pacific
> Subject: [Genome] Questions about the dbSNP130 table
>
> Hello,
> 
> I have used the UCSC table browser to download the complete snp130
> table 
> as a tab-separated text file. Since I am only interested in single 
> nucleotide substitutions, where one nucleotide is replaced by another
> 
> one, I selected from this file only the lines for which the 'class' 
> field is equal to 'single'. But when I take a look at the resulting 
> subset of entries, I realize that there are some lines that should not
> 
> appear in the filtered file. 
> 
> Here is a copy/paste of some lines for which I need additional 
> explanations :
> 
> #bin    chrom    chromStart    chromEnd    name    score    strand   
> 
> refNCBIrefUCSC    observed    molType    class    valid    avHet    
> avHetSE    func    locTypeweight
> 
> 212    chr1    146668316    146677849    rs2137935    0    +    (
> 9533bp 
> insertion )    ( 9533bp insertion )    C/G    genomic    single    
> unknown    0    0    unknown    range    3
> 585    chr1    126656    126673    rs72497839    0    -    
> GCTCGGGCTGACCTCTC    GCTCGGGCTGACCTCTC    A/C    genomic    single   
> 
> unknown    0    0    unknown    range    1
> 585    chr1    92822    92822    rs4317776    0    -    -    -    A/C 
>   
> genomic    single    unknown    0    0    unknown    between    3
> 586    chr1    155165    155165    rs1974329    0    -    -    -    
> G/T    genomic    single    unknown    0    0    unknown    between   
> 3
> 586    chr1    148894    148895    rs4111311    0    -    C    C    
> G/T    genomic    single    unknown    0    0    unknown    
> rangeDeletion    3
> 
> For the first two lines, I think I get the point: since a group of 
> nucleotides is replaced by a single one, then the entry is given the 
> class 'single' but the 'locType' field is set to 'range', because it
> is 
> a range of nucleotides which is actually replaced by a single one. So
> 
> the first two lines should be correct.
> 
> However for the third and forth lines, I do not understand why  the 
> class is 'single' since apparently they are insertions and should have
> 
> the 'insertion' class in the database.
> 
> And finally, for the fifth line, I do not understand why the 'locType'
> 
> field is 'rangeDeletion' since apparently it is a single nucleotide 
> substitution and the value of 'loctype' should be 'exact'.
> 
> Are there minor mistakes in the snp130 table or did I miss something 
> about the classification of the entries ?
> 
> And consequently, if I want to extract only single nucleotide 
> substitutions, where a single nucleotide is replaced by another single
> 
> nucleotide, should I select entries for which the 'class' and
> 'locType' 
> fields are respectively equal to 'single' and 'exact' ? Or is there a
> 
> possibility that undesired entries can pass this filter ?
> 
> Thank you for reading me
> 
> Best regards
> 
> David
> 
> -- 
> David Gacquer, Ph. D.
> 
> IRIBHM - Universite Libre de Bruxelles
> Bldg C, room C.4.117
> ULB, Campus Erasme, CP602
> 808 route de Lennik
> B-1070 Brussels
> Belgium
> 
> Phone: +32-2-555 4187
> Fax: +32-2-555 4655
> E-mail: dgacquer at ulb.ac.be 
> 
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to