I read through the 324 abstracts Robert K sent for the 2011 ASHG meeting. As 
usual, a vast number of papers found coding changes in this or that gene the 
explanation for a disease condition, sequencing entire family genomes only to 
throw all the gigabases away at the end (except for that 1bp they wanted).

One option used for evaluating a non-synonymous SNP: if the gene has an 
available 3D structure from xray or nmr, the variant can be stubbed in, the 
structural effect evaluated, and the substitution characterized as bad or 
neutral (good is rare and much harder to prove).

The problem here is the 'if'. As far as I know, nobody explicitly tracks how 
many of the 20,000 human genes have an experimentally determined structure nor 
graphs how fast we are progressing per year to 100% due to proteomics 
initiatives, for purposes of SNP evaluation. I would estimate an 80-100 year 
time frame on this the way things are going.

This is because human proteins are large (ave 450 aa relative to a typical 
structural determination (ave 260 aa over the 1,000 most recent PDB additions) 
and 20% or so are not soluble and may never be amenable to structural study.

Below I took 50 genes at random and blastped them individually against human 
and non-human PDB:

-- 18% had a determined structure for the human query; however these were 
seldom full length resulting in 68% coverage on average

-- these are somewhat over-weighted on disease genes (adjusted for incidence) 
but many structures have been determined for small enzymes of low clinical 
interest.

-- another 30% had a determined structure for a human paralog; however these 
had both low coverage and low id resulting in 18% coverage of a given SNP 
requiring identity.

-- very few situations were improved using structures from homologous proteins 
in other species.

-- overall, chance of a given SNP having any kind of structural coverage of the 
original amino acid was 17%, so if there was a 25% chance of the enveloping 
patch providing reliable structural evaluation of the SNP, this option works 
out roughly 4-5% of the time (after adjusting for much shorter protein length 
in the on-target structures). Ab initio calculations of structure are not at a 
point where they can affect SNP evaluation statistics.

In summary, 3D structure is a nice tool in the toolbox to evaluate nsSNPs but 
it is rarely applicable, whereas massive comparative genomics + human variation 
frequencies along the protein is universally available. The latter data, though 
available already, will be a done deal in 2-3 years. Finally, it is better 
suited to computerized evaluation without subjective human intervention and so 
to personal genomic medicine.

For the genome browser, while it is fine to provide rs9898090-type links to 
chimera and LS-SNP on the details page, it is more useful to the visitor if we 
supplement the existing 46-way comparative genomics with the human variation 
frequencies (which can be done on the same display over the human line with 
logo sizes). So that is a top priority, to extract the naturally occurring 
amino acid variant frequencies from the many new studies above. The 1000k 
genome pj may be the only realistic source for that. It is complicated by 
ethnic group differences so needs stratification. 

With these data, we could precompute the dysfunctional effect of all 19 
possible aa substitutions at every site in the 9,000,000 aa proteome, and for 
other key species as well.



gene  coverage  id  cov*id  length  
TYMP      100%  100% 482  482  
TBC1D2     63%  100% 326  517  
PPARA      59%  100% 276  468  
MAPK11    100%  99%  360  364  
MAPK12    100%  99%  363  367  
ARSA       96%  99%  484  509  
SCO2       63%  99%  166  266  
BRD1       12%  99%  126  1058  
CHKB       98%  98%  379  395  
MIOX       87%  93%  231  285  
PIM3       96%  70%  219  326  
SBF1       40%  66%  500  1893  
PLXNB2     63%  50%  579  1838  
FBLN1      53%  45%  168  703  
MAPK8IP     7%  42%   23  797  
ACR        59%  41%  102  421  
HDAC10     63%  38%  160  669  
CELSR1     28%  38%  321  3014  
RABL2B     70%  36%   58   229  
SHANK3     12%  34%   71  1747  
CPT1B      75%  32%  185  772  
KLHDC7B    47%  30%   84  594  
MOV10L1    36%  30%  131  1211  
TUBGCP6    12%  28%   61  1819  
ADM2        0    0     0   148  
LMF2        0    0     0   707  
NCAPH2      0    0     0  606  
ODF3B       0    0     0  253  
C22orf41    0    0     0  88  
PPP6R2      0    0     0  959  
FAM116B     0    0     0  585  
SELO        0    0     0  669  
TRABD       0    0     0  376  
PANX2       0    0     0  677  
MLC1        0    0     0  377  
IL17REL     0    0     0  336  
CRELD2      0    0     0  402  
ALG12       0    0     0  488  
ZBED4       0    0     0  1171  
FAM19A5     0    0     0  132  
CERK        0    0     0  537  
GRAMD4      0    0     0  578  
TRMU        0    0     0  421  
GTSE1       0    0     0  739  
TTC38       0    0     0  469  
PKDREJ      0    0     0  2253  
RP4-695     0    0     0  219  
WNT7B       0    0     0  349  
ATXN10      0    0     0  475  
RIBC2       0    0     0  377  
ave        29%  31%  117  702 
chance of a SNP having any kind of coverage:  17%  

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to