Hello,

I have a question about the annotation/alignment tables of dbSNP.

It appears to me that for a large number of SNPs (basically large  
insertions and deletions) the annotation data provided both in the BCP  
files and the ASN1/XML reports are not correct about positions on  
contigs.

Take in example rs10704866. A simple genomic search (done with UCSC's  
BLAT, but blast would give the same) easily places the flanking  
sequences at xxxx-112793533 (3') and 112794911-xxxxx (5') on chr1,  
while the reportedly deleted sequence matches perfectly in the region  
inbetween, that is: the real SNP position is on range between  
112793533 and 112794910 of chr 1 (build reference 36_3), where 1379  
bases of the genomic sequence are (possibly) deleted.

The database reports the alignment to be as (here I show the raw table  
line from SNPContigLoc, but ASN1 and XML reports are consistent with  
this):
rs      10704866        14      8901073 8901073 399     401     8901072 8901074 
2       112794911       129      
2007-12-03 15:25:06.0   0       0       C               1.0     1       0       
1379

Where:
asn_from = 8901073
asn_to = 8901073
lf_ngbr = 399
rf_ngbr = 401
lc_ngbr = 8901072
rc_ngbr = 8901074
loc_type = 2 = (as per LocTypeCode.bcp.gz) trueSNP, described as  
"Contig allele is one base long.snp is always represented as one base  
and this one base in the snp sequence is substituted with exactly one  
base on the contig."
.....
num_del = 0
num_ins = 1379

Besides the strange terminology in defining deletions versus  
insertions, problem is that this is *not* a "true SNP", it is a large  
deletion and the alignment correctly reported 1379 bases deleted  
(num_ins), and while rc_ngbr is correct (8901074 on NT_019273 maps to  
position 112794912 on chr1, which is the first base non-deleted) it  
seems to me that lc_ngbr is not correct.

Shouldn't for large deletions the lf_ngbr, rf_ngbr pair identify the  
deleted range and the locType be "range" ?

How can I retrieve (besides redoing the alignment) the deleted range ?

I note that this issue (being it a systematic error affecting  
essentially all the large deletions on the database or an  
interpretation error of the documentation) has impacted also the  
annotation tables on the UCSC genome browser, this is why I put in cc:  
also the genome browser maintainers: by searching rs10704866 on the  
UCSC's genome browser you find it as Position:  
chr1:112794912-112794912 and Genomic Size: 1.

As said the issue affects all the large deletions (and likely also  
other types of polymorphisms); while in most cases looking at the  
alignment of the flaking sequences is enough to understand the  
situation this makes the data unsuitable for automated analysis.

Thank you for any support you can provide,

my best regards,

Andrea Cocito.

Site visited: 
http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPContigLoc 
. Please check the online FAQ 
(http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq 
) for answers to your question or email snp-admin for help. Thank you.  
(-Snpadmin)





--------------------------
Andrea Cocito
[email protected]

IEO -- European Institute of Oncology
Department of Experimental Oncology
Via Ripamonti 435
20141 Milano - Italy
tel: +39-02-94375075
fax: +39-02-57489851

IFOM -- FIRC Institute of Molecular Oncology
Via Adamello 16
20139 Milano - Italy
tel: +39-02-574303853
fax: +39-02-574303231




_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to