Hi, Brooke, Thank you very much for the information!
According to the data I downloaded from ftp://ftp.ncbi.nlm.nih.gov/hapmap, I saw 4,163,790 RefSNPs investigated in HapMap Phase II and III. However, in snp132common's statistics, there are only 2M. So about 2M Hapmap SNPs are filtered out by the criteria of "unique mapping" and "1% frequency", which seems a bit too many. My calculation shows about 80% of the 4M hapmap SNPs have frequency >1%. I didn't check unique mappings though. I'm glad you also spot this discrepancy. Best Xiang -----Original Message----- From: Brooke Rhead [mailto:[email protected]] Sent: Monday, April 25, 2011 4:06 PM To: Xiang Li Cc: [email protected] Subject: Re: [Genome] Too few HapMap SNPs Hi Xiang, The snp132Common track is a subset of snp132, so it makes sense that there are fewer HapMap SNPs there. (snp132Common contains uniquely mapped variants that have frequency info and appear in at least 1% of the population -- see our announcement about the 4 new SNP tracks here: http://genome.ucsc.edu/goldenPath/newsarch.html#041811.2 .) However, the snp132 table only contains about 3.17 million SNPs listed as 'by-hapmap' in the valid field, which seems low. One of our engineers is looking into this further. Regarding the validation codes, we don't have a more elaborate explanation for you. We suggest contacting dbSNP directly at [email protected]. They might be able to point you to better documentation. -- Brooke Rhead UCSC Genome Bioinformatics Group On 04/22/11 17:23, Xiang Li wrote: > Hi > > > > I downloaded the the Table: snp132Common from Track: Common SNPs(132). > > > > A quick statistics is shown as below: > > > > #type of validation count > > by-1000genomes 4698898 > > by-2hit-2allele 1064881 > > by-cluster 3412491 > > by-frequency 2847474 > > by-hapmap 717332 > > by-submitter 138079 > > unknown 54311 > > > > I saw only 717332 from hapmap, while in Hapmap FTP site > (ftp://ftp.ncbi.nlm.nih.gov/hapmap), I saw over 4 million SNPs. > > Why is there such a huge difference? Thanks > > > > Also, where could I found a more detailed README regarding those > validation types, so that I can have a better idea of assess each type? > Currently, I can only assume Hapmap and 1000Genomes are more reliable > than the others. > > * Validation > <http://www.ncbi.nlm.nih.gov/SNP/snp_legend.cgi?legend=validation> : > Method used to validate the variant (each variant may be validated by > more than one method) > > * By Frequency - at least one submitted SNP in cluster has > frequency data submitted > * By Cluster - cluster has at least 2 submissions, with at > least one submission assayed with a non-computational method > * By Submitter - at least one submitter SNP in cluster was > validated by independent assay > * By 2 Hit/2 Allele - all alleles have been observed in at > least 2 chromosomes > * By HapMap - submitted by HapMap > <http://hapmap.ncbi.nlm.nih.gov/> project (human only) > * By 1000Genomes - submitted by 1000Genomes > <http://1000genomes.org/> project (human only) > * Unknown - no validation has been reported for this > variant > > Thanks > > > > Sean > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
