I read through the 324 abstracts Robert K sent for the 2011 ASHG meeting. As usual, a vast number of papers found coding changes in this or that gene the explanation for a disease condition, sequencing entire family genomes only to throw all the gigabases away at the end (except for that 1bp they wanted).
One option used for evaluating a non-synonymous SNP: if the gene has an available 3D structure from xray or nmr, the variant can be stubbed in, the structural effect evaluated, and the substitution characterized as bad or neutral (good is rare and much harder to prove). The problem here is the 'if'. As far as I know, nobody explicitly tracks how many of the 20,000 human genes have an experimentally determined structure nor graphs how fast we are progressing per year to 100% due to proteomics initiatives, for purposes of SNP evaluation. I would estimate an 80-100 year time frame on this the way things are going. This is because human proteins are large (ave 450 aa relative to a typical structural determination (ave 260 aa over the 1,000 most recent PDB additions) and 20% or so are not soluble and may never be amenable to structural study. Below I took 50 genes at random and blastped them individually against human and non-human PDB: -- 18% had a determined structure for the human query; however these were seldom full length resulting in 68% coverage on average -- these are somewhat over-weighted on disease genes (adjusted for incidence) but many structures have been determined for small enzymes of low clinical interest. -- another 30% had a determined structure for a human paralog; however these had both low coverage and low id resulting in 18% coverage of a given SNP requiring identity. -- very few situations were improved using structures from homologous proteins in other species. -- overall, chance of a given SNP having any kind of structural coverage of the original amino acid was 17%, so if there was a 25% chance of the enveloping patch providing reliable structural evaluation of the SNP, this option works out roughly 4-5% of the time (after adjusting for much shorter protein length in the on-target structures). Ab initio calculations of structure are not at a point where they can affect SNP evaluation statistics. In summary, 3D structure is a nice tool in the toolbox to evaluate nsSNPs but it is rarely applicable, whereas massive comparative genomics + human variation frequencies along the protein is universally available. The latter data, though available already, will be a done deal in 2-3 years. Finally, it is better suited to computerized evaluation without subjective human intervention and so to personal genomic medicine. For the genome browser, while it is fine to provide rs9898090-type links to chimera and LS-SNP on the details page, it is more useful to the visitor if we supplement the existing 46-way comparative genomics with the human variation frequencies (which can be done on the same display over the human line with logo sizes). So that is a top priority, to extract the naturally occurring amino acid variant frequencies from the many new studies above. The 1000k genome pj may be the only realistic source for that. It is complicated by ethnic group differences so needs stratification. With these data, we could precompute the dysfunctional effect of all 19 possible aa substitutions at every site in the 9,000,000 aa proteome, and for other key species as well. gene coverage id cov*id length TYMP 100% 100% 482 482 TBC1D2 63% 100% 326 517 PPARA 59% 100% 276 468 MAPK11 100% 99% 360 364 MAPK12 100% 99% 363 367 ARSA 96% 99% 484 509 SCO2 63% 99% 166 266 BRD1 12% 99% 126 1058 CHKB 98% 98% 379 395 MIOX 87% 93% 231 285 PIM3 96% 70% 219 326 SBF1 40% 66% 500 1893 PLXNB2 63% 50% 579 1838 FBLN1 53% 45% 168 703 MAPK8IP 7% 42% 23 797 ACR 59% 41% 102 421 HDAC10 63% 38% 160 669 CELSR1 28% 38% 321 3014 RABL2B 70% 36% 58 229 SHANK3 12% 34% 71 1747 CPT1B 75% 32% 185 772 KLHDC7B 47% 30% 84 594 MOV10L1 36% 30% 131 1211 TUBGCP6 12% 28% 61 1819 ADM2 0 0 0 148 LMF2 0 0 0 707 NCAPH2 0 0 0 606 ODF3B 0 0 0 253 C22orf41 0 0 0 88 PPP6R2 0 0 0 959 FAM116B 0 0 0 585 SELO 0 0 0 669 TRABD 0 0 0 376 PANX2 0 0 0 677 MLC1 0 0 0 377 IL17REL 0 0 0 336 CRELD2 0 0 0 402 ALG12 0 0 0 488 ZBED4 0 0 0 1171 FAM19A5 0 0 0 132 CERK 0 0 0 537 GRAMD4 0 0 0 578 TRMU 0 0 0 421 GTSE1 0 0 0 739 TTC38 0 0 0 469 PKDREJ 0 0 0 2253 RP4-695 0 0 0 219 WNT7B 0 0 0 349 ATXN10 0 0 0 475 RIBC2 0 0 0 377 ave 29% 31% 117 702 chance of a SNP having any kind of coverage: 17% _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
