Hello, I have a question about the annotation/alignment tables of dbSNP.
It appears to me that for a large number of SNPs (basically large insertions and deletions) the annotation data provided both in the BCP files and the ASN1/XML reports are not correct about positions on contigs. Take in example rs10704866. A simple genomic search (done with UCSC's BLAT, but blast would give the same) easily places the flanking sequences at xxxx-112793533 (3') and 112794911-xxxxx (5') on chr1, while the reportedly deleted sequence matches perfectly in the region inbetween, that is: the real SNP position is on range between 112793533 and 112794910 of chr 1 (build reference 36_3), where 1379 bases of the genomic sequence are (possibly) deleted. The database reports the alignment to be as (here I show the raw table line from SNPContigLoc, but ASN1 and XML reports are consistent with this): rs 10704866 14 8901073 8901073 399 401 8901072 8901074 2 112794911 129 2007-12-03 15:25:06.0 0 0 C 1.0 1 0 1379 Where: asn_from = 8901073 asn_to = 8901073 lf_ngbr = 399 rf_ngbr = 401 lc_ngbr = 8901072 rc_ngbr = 8901074 loc_type = 2 = (as per LocTypeCode.bcp.gz) trueSNP, described as "Contig allele is one base long.snp is always represented as one base and this one base in the snp sequence is substituted with exactly one base on the contig." ..... num_del = 0 num_ins = 1379 Besides the strange terminology in defining deletions versus insertions, problem is that this is *not* a "true SNP", it is a large deletion and the alignment correctly reported 1379 bases deleted (num_ins), and while rc_ngbr is correct (8901074 on NT_019273 maps to position 112794912 on chr1, which is the first base non-deleted) it seems to me that lc_ngbr is not correct. Shouldn't for large deletions the lf_ngbr, rf_ngbr pair identify the deleted range and the locType be "range" ? How can I retrieve (besides redoing the alignment) the deleted range ? I note that this issue (being it a systematic error affecting essentially all the large deletions on the database or an interpretation error of the documentation) has impacted also the annotation tables on the UCSC genome browser, this is why I put in cc: also the genome browser maintainers: by searching rs10704866 on the UCSC's genome browser you find it as Position: chr1:112794912-112794912 and Genomic Size: 1. As said the issue affects all the large deletions (and likely also other types of polymorphisms); while in most cases looking at the alignment of the flaking sequences is enough to understand the situation this makes the data unsuitable for automated analysis. Thank you for any support you can provide, my best regards, Andrea Cocito. Site visited: http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPContigLoc . Please check the online FAQ (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq ) for answers to your question or email snp-admin for help. Thank you. (-Snpadmin) -------------------------- Andrea Cocito [email protected] IEO -- European Institute of Oncology Department of Experimental Oncology Via Ripamonti 435 20141 Milano - Italy tel: +39-02-94375075 fax: +39-02-57489851 IFOM -- FIRC Institute of Molecular Oncology Via Adamello 16 20139 Milano - Italy tel: +39-02-574303853 fax: +39-02-574303231 _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
