> Dec 1, 2011, at 12:02 PM, Donna Karolchik wrote: - when hings are broken on  
> A brief email to [email protected]  

Ok, I am looking at the description page for human genes on genome-test, 
section Protein Domain and Structure Information. Hooray, it appears that the 
crummy modBase is being supplemented by display of experimental PDB files 
(rather than dubious or fragmentary modBase structural predictions) and then 
supplemented by links to two outside tools that predict the significance to 
function of each non-synonymous SNP.

I would not recommend featuring a LS-SNP link. It is just not ready for prime 
time and development looks abandoned to autopilot refreshes. The measure of 
conservation used is not up to literature standards -- no tree topology, just 
percent occurrence from a mashup of uncurated orthologs, paralogs, 3rd rate 
gnomon models and unannotated pseudogenes while omitting the readily available 
genomic data (by restricting itself to the pathetic NCBI nr). The tiny, static 
display imagery is 15 years behind the technology; the bulk aa properties is an 
approach from the 1950's. We can't start linking to sites just because someone 
previously worked at UCSC -- that starts a death spiral.

To see what serious structural evaluation of a disease SNP really entails, see 
the 4 paragraphs of analysis of L80F of DHFR beginning with "To explore the 
mechanism of loss..." in  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035707/. 
I don't see how we could possibly offer this level of sophistication. Chimera 
is a much better start but it too does not provide it. Crystallographers have a 
great many powerful graphics and computational tools but as a genome browser, I 
don't think we can go there..

Note SNP evaluation is a very different topic from protein domain and 
structure. Because of its central importance to visitors, it should be given 
its own section in the details page. Only a few of the many competing SNP 
evaluation tools utilize 3D structure (which is seldom available, exceedingly 
complex with nmr ensembles, and not always informative).  

Here UCSC already provides a phenomenal tool for SNP evaluation (the mis-named 
proteinFasta link on this same page) but you aren't utilizing it. 

I would guess at most a few percent the 20,000 human genes have structural 
experimental data derived from physical human protein. More typical is a 
partial match to another species with varying levels of homology and alignable 
fraction. In most cases, all significant Blastp matches at PDB are to non-human 
entries, and there often just to a sub-domain. These are nonetheless 
exceedingly useful in evaluating a given residue change. I am really sceptical 
that ab initio structure calculations like modBase have anything to offer to 
the average user for SNP interpretation at this time. 

So it would be useful to provide a little Blastp output table with the first 
3-4 matches, the species source, percent identity, and region of alignment (and 
best the alignment itself). This PDB Blastp should logically originate over at 
GeneSorter (which already carries blastp alignments but to human proteome 
rather than PDB) and copied over. Note GeneSorter carries the rs5656532 SNPs 
but links to useful but highly verbose NCBI description tables rather than 
Chimera. 

The description page SNP section should also link to NCBI as it has the 
mission-critical human population frequency data -- the very first item lab 
people want. Both GeneSorter and description page need to always provide the 
substitution at the amino acid level, eg T67D. No one finds any meaning in the 
rs4567890 numbers -- it is merely an indexing field. 

A pity that they did not build something meaningful into the rs names. Now that 
actually is an opportunity for UCSC, to append something useful (like C to G, 
arg to trp at position 43 in the ref seq at 3% frequency at 1000 HGPj: 
.004303CGRW.rs4567890. Then the visitor can see at a glance what it is without 
a tedious chase. And the SNPs for a given gene sort nicely by position, 
frequency and type.

On a similar note, to avoid tedious manual edits, both the details page and 
GeneSorter should de-emphasize the unmemorable gene indexing number. A whole 
lot of pages on the browser have lost track of the gene name and are just 
providing the in-house gene name, eg

>uc003bll.1 (MIOX) length=285  should read >MIOX length=285 uc003bll.1    
MKVTVGPDPSLVYRPDVDPEVAK...                 MKVTVGPDPSLVYRPDVDPEVAK...

Thus for universal SNP evaluation, it is better to begin with comparative 
genomics conservation data since that is always available for every SNP 
regardless of protein size or membrane association (eg GPCR receptors) and is 
exploding in quality from next gen sequencing (both number of species and 
population frequencies in human), unlike xray/nmr which are many decades away 
from coverage. UCSC is already precomputing this for other reasons so it is not 
an additional compute burden.

On the description page, you could display a teaser for the conservation 
environment of known to a fixed phylogenetic depth and point to the fuller 
proteinFasta with the SNPs displayed in a line above human. proteinFasta needs 
a differential display mode (shown below) and some rudimentary statistics 
gathering (saveable floating popup over each aa). Here  everything you need is 
already available in an existing table. In the example (which I truncated to 
Euarchonta), you can already see that L80F is going to wreck the protein. 
Nature already experimented with F at this position and found it didn't work. 
For personal genomics, it doesn't really matter why ... the 3D structural 
explanation (available with a lot of hard extra work) is just frosting on the 
cake. 

........... ....F ....  ......................
DHFR_homSap LSREL KEPP  Homo  sapiens  (human)
DHFR_panTro ..... ....  Pan  troglodytes  (chimp)
DHFR_gorGor ..... ....  Gorilla  gorilla  (gorilla)
DHFR_ponAbe V.... .Q..  Pongo  abelii  (orangutan)
DHFR_nomLeu V.... ....  Nomascus  leucogenys  (gibbon)
DHFR_macMul ..... .Q..  Macaca  mulatta  (rhesus)
DHFR_papAnu I.... ....  Papio  anubis  (baboon)
DHFR_calJac ...D. ....  Callithrix  jacchus  (marmoset)
DHFR_tarSyr ..... .V..  Tarsius  syrichta  (tarsier)
DHFR_otogar ..K.. ..S.  Otolemur  garnettii  (bushbaby)
DHFR_micMur ..K.. ..S.  Microcebus  murinus  (lemur)
DHFR_tupBel ..K.. ....  Tupaia  belangeri  (treeshrew)


I finally found a gene where the chimera link worked but my computer initially 
had no idea how to open the chimerax file type. Buried in the help, it explains 
about the 85 meg download.  I foresee all sorts of platform problems with this. 
For example they are already dropping support for non-Intel Macs. This is 
making UCSC too dependent on an outside party over which we have no control. It 
is giving a lot of weight to one of many offsite tools for evaluating SNPs. We 
need to focus on providing the basics.

It is better to use in-house resources so UCSC retains control of quality, 
availability, and updating than to send visitors off-site to third-party links. 
proteinFasta does a far better job on certain aspects of SNP evaluation than 
LS-SNP and many others (more extensive comparative genomics, correct 
phylogenetic tree, regular updating for new genomes). But mainly, we should 
provide all the easy things before sending people off on difficult things 
(molecular dynamics calculation of perturbed structure).

 Here is the first thing the visitor sees after installation when they open a 
UCSC link to Chimera -- yuch:




Now Chimera happens do a very nice job of displaying a larger interactive image 
in which the visitor can twist and rotate the image with their mouse. However 
many other free tools do this as well.  It persists in displaying the 
rs51456431 notation for SNPs whereas visitors want V23L notation. I didn't see 
how to enter a SNP of my own like RasMol or SwissModel. A powerful tool for 
sure but also a way of life. 

Back on the description page, it is enough just to list just 1-2 PDB entries -- 
the visitor does not need 39 structures for one small protein (eg PRNP entry). 
Nmr and xray structural determination often begins by clipping the protein down 
to something smaller (which can drop the SNP), so the numbering (eg 25-124 of 
1-237) should be shown. If it is not from human, the species should be 
specified and the percent identity.

While there is value in simple visual localization of the SNP within the 
secondary and tertiary and especially domain structure, these little thumbnails 
do not provide a working environment. I don't see any particular value in three 
small views from top, back, side. One is enough, it is just a toy. It would be 
more useful to have one graphic,  a depiction (with SNP locator) of secondary 
structure (helix, sheet, coil and so forth), and a simple picture of domain 
structure. To get fancy, mousing over an rs would locate it in the latter two 
pictures.


Some links to Chimera are broken. Example: 
http://genome-test.cse.ucsc.edu/trash/lssnp/1E1U_genome_test_182e_d08f70.chimerax.
 

Oddly, the public browser still shows modBase for PRNP whereas it has switched 
to Chimera- LS-SNP for the gene LDRL. So I guess things are in flux.


 
 


 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to