Hi Thomas, Thank your for your suggestions. We will add them to our list of things to consider.
Vanessa Kirkup Swing UCSC Genome Bioinformatics Group ---------- Forwarded message ---------- From: thomas pringle <[email protected]> Date: Tue, Dec 6, 2011 at 8:03 AM Subject: [Genome] feedback on description page: Protein Domain and Structure Information To: [email protected] Cc: Donna Karolchik <[email protected]> > Dec 1, 2011, at 12:02 PM, Donna Karolchik wrote: - when hings are broken on A brief email to [email protected] Ok, I am looking at the description page for human genes on genome-test, section Protein Domain and Structure Information. Hooray, it appears that the crummy modBase is being supplemented by display of experimental PDB files (rather than dubious or fragmentary modBase structural predictions) and then supplemented by links to two outside tools that predict the significance to function of each non-synonymous SNP. I would not recommend featuring a LS-SNP link. It is just not ready for prime time and development looks abandoned to autopilot refreshes. The measure of conservation used is not up to literature standards -- no tree topology, just percent occurrence from a mashup of uncurated orthologs, paralogs, 3rd rate gnomon models and unannotated pseudogenes while omitting the readily available genomic data (by restricting itself to the pathetic NCBI nr). The tiny, static display imagery is 15 years behind the technology; the bulk aa properties is an approach from the 1950's. We can't start linking to sites just because someone previously worked at UCSC -- that starts a death spiral. To see what serious structural evaluation of a disease SNP really entails, see the 4 paragraphs of analysis of L80F of DHFR beginning with "To explore the mechanism of loss..." in http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035707/. I don't see how we could possibly offer this level of sophistication. Chimera is a much better start but it too does not provide it. Crystallographers have a great many powerful graphics and computational tools but as a genome browser, I don't think we can go there.. Note SNP evaluation is a very different topic from protein domain and structure. Because of its central importance to visitors, it should be given its own section in the details page. Only a few of the many competing SNP evaluation tools utilize 3D structure (which is seldom available, exceedingly complex with nmr ensembles, and not always informative). Here UCSC already provides a phenomenal tool for SNP evaluation (the mis-named proteinFasta link on this same page) but you aren't utilizing it. I would guess at most a few percent the 20,000 human genes have structural experimental data derived from physical human protein. More typical is a partial match to another species with varying levels of homology and alignable fraction. In most cases, all significant Blastp matches at PDB are to non-human entries, and there often just to a sub-domain. These are nonetheless exceedingly useful in evaluating a given residue change. I am really sceptical that ab initio structure calculations like modBase have anything to offer to the average user for SNP interpretation at this time. So it would be useful to provide a little Blastp output table with the first 3-4 matches, the species source, percent identity, and region of alignment (and best the alignment itself). This PDB Blastp should logically originate over at GeneSorter (which already carries blastp alignments but to human proteome rather than PDB) and copied over. Note GeneSorter carries the rs5656532 SNPs but links to useful but highly verbose NCBI description tables rather than Chimera. The description page SNP section should also link to NCBI as it has the mission-critical human population frequency data -- the very first item lab people want. Both GeneSorter and description page need to always provide the substitution at the amino acid level, eg T67D. No one finds any meaning in the rs4567890 numbers -- it is merely an indexing field. A pity that they did not build something meaningful into the rs names. Now that actually is an opportunity for UCSC, to append something useful (like C to G, arg to trp at position 43 in the ref seq at 3% frequency at 1000 HGPj: .004303CGRW.rs4567890. Then the visitor can see at a glance what it is without a tedious chase. And the SNPs for a given gene sort nicely by position, frequency and type. On a similar note, to avoid tedious manual edits, both the details page and GeneSorter should de-emphasize the unmemorable gene indexing number. A whole lot of pages on the browser have lost track of the gene name and are just providing the in-house gene name, eg >uc003bll.1 (MIOX) length=285 should read >MIOX length=285 uc003bll.1 MKVTVGPDPSLVYRPDVDPEVAK... MKVTVGPDPSLVYRPDVDPEVAK... Thus for universal SNP evaluation, it is better to begin with comparative genomics conservation data since that is always available for every SNP regardless of protein size or membrane association (eg GPCR receptors) and is exploding in quality from next gen sequencing (both number of species and population frequencies in human), unlike xray/nmr which are many decades away from coverage. UCSC is already precomputing this for other reasons so it is not an additional compute burden. On the description page, you could display a teaser for the conservation environment of known to a fixed phylogenetic depth and point to the fuller proteinFasta with the SNPs displayed in a line above human. proteinFasta needs a differential display mode (shown below) and some rudimentary statistics gathering (saveable floating popup over each aa). Here everything you need is already available in an existing table. In the example (which I truncated to Euarchonta), you can already see that L80F is going to wreck the protein. Nature already experimented with F at this position and found it didn't work. For personal genomics, it doesn't really matter why ... the 3D structural explanation (available with a lot of hard extra work) is just frosting on the cake. ........... ....F .... ...................... DHFR_homSap LSREL KEPP Homo sapiens (human) DHFR_panTro ..... .... Pan troglodytes (chimp) DHFR_gorGor ..... .... Gorilla gorilla (gorilla) DHFR_ponAbe V.... .Q.. Pongo abelii (orangutan) DHFR_nomLeu V.... .... Nomascus leucogenys (gibbon) DHFR_macMul ..... .Q.. Macaca mulatta (rhesus) DHFR_papAnu I.... .... Papio anubis (baboon) DHFR_calJac ...D. .... Callithrix jacchus (marmoset) DHFR_tarSyr ..... .V.. Tarsius syrichta (tarsier) DHFR_otogar ..K.. ..S. Otolemur garnettii (bushbaby) DHFR_micMur ..K.. ..S. Microcebus murinus (lemur) DHFR_tupBel ..K.. .... Tupaia belangeri (treeshrew) I finally found a gene where the chimera link worked but my computer initially had no idea how to open the chimerax file type. Buried in the help, it explains about the 85 meg download. I foresee all sorts of platform problems with this. For example they are already dropping support for non-Intel Macs. This is making UCSC too dependent on an outside party over which we have no control. It is giving a lot of weight to one of many offsite tools for evaluating SNPs. We need to focus on providing the basics. It is better to use in-house resources so UCSC retains control of quality, availability, and updating than to send visitors off-site to third-party links. proteinFasta does a far better job on certain aspects of SNP evaluation than LS-SNP and many others (more extensive comparative genomics, correct phylogenetic tree, regular updating for new genomes). But mainly, we should provide all the easy things before sending people off on difficult things (molecular dynamics calculation of perturbed structure). Here is the first thing the visitor sees after installation when they open a UCSC link to Chimera -- yuch: Now Chimera happens do a very nice job of displaying a larger interactive image in which the visitor can twist and rotate the image with their mouse. However many other free tools do this as well. It persists in displaying the rs51456431 notation for SNPs whereas visitors want V23L notation. I didn't see how to enter a SNP of my own like RasMol or SwissModel. A powerful tool for sure but also a way of life. Back on the description page, it is enough just to list just 1-2 PDB entries -- the visitor does not need 39 structures for one small protein (eg PRNP entry). Nmr and xray structural determination often begins by clipping the protein down to something smaller (which can drop the SNP), so the numbering (eg 25-124 of 1-237) should be shown. If it is not from human, the species should be specified and the percent identity. While there is value in simple visual localization of the SNP within the secondary and tertiary and especially domain structure, these little thumbnails do not provide a working environment. I don't see any particular value in three small views from top, back, side. One is enough, it is just a toy. It would be more useful to have one graphic, a depiction (with SNP locator) of secondary structure (helix, sheet, coil and so forth), and a simple picture of domain structure. To get fancy, mousing over an rs would locate it in the latter two pictures. Some links to Chimera are broken. Example: http://genome-test.cse.ucsc.edu/trash/lssnp/1E1U_genome_test_182e_d08f70.chimerax . Oddly, the public browser still shows modBase for PRNP whereas it has switched to Chimera- LS-SNP for the gene LDRL. So I guess things are in flux. _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
