Re: [Genome] feedback on description page: Protein Domain and Structure Information

Vanessa Kirkup Swing Tue, 06 Dec 2011 10:53:39 -0800

Hi Thomas,

Thank your for your suggestions. We will add them to our list of things to
consider.

Vanessa Kirkup Swing
UCSC Genome Bioinformatics Group

---------- Forwarded message ----------
From: thomas pringle <[email protected]>
Date: Tue, Dec 6, 2011 at 8:03 AM
Subject: [Genome] feedback on description page: Protein Domain and
Structure Information
To: [email protected]
Cc: Donna Karolchik <[email protected]>

> Dec 1, 2011, at 12:02 PM, Donna Karolchik wrote: - when hings are broken
on  A brief email to [email protected]

Ok, I am looking at the description page for human genes on genome-test,
section Protein Domain and Structure Information. Hooray, it appears that
the crummy modBase is being supplemented by display of experimental PDB
files (rather than dubious or fragmentary modBase structural predictions)
and then supplemented by links to two outside tools that predict the
significance to function of each non-synonymous SNP.

I would not recommend featuring a LS-SNP link. It is just not ready for
prime time and development looks abandoned to autopilot refreshes. The
measure of conservation used is not up to literature standards -- no tree
topology, just percent occurrence from a mashup of uncurated orthologs,
paralogs, 3rd rate gnomon models and unannotated pseudogenes while omitting
the readily available genomic data (by restricting itself to the pathetic
NCBI nr). The tiny, static display imagery is 15 years behind the
technology; the bulk aa properties is an approach from the 1950's. We can't
start linking to sites just because someone previously worked at UCSC --
that starts a death spiral.

To see what serious structural evaluation of a disease SNP really entails,
see the 4 paragraphs of analysis of L80F of DHFR beginning with "To explore
the mechanism of loss..." in
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035707/. I don't see how we
could possibly offer this level of sophistication. Chimera is a much better
start but it too does not provide it. Crystallographers have a great many
powerful graphics and computational tools but as a genome browser, I don't
think we can go there..

Note SNP evaluation is a very different topic from protein domain and
structure. Because of its central importance to visitors, it should be
given its own section in the details page. Only a few of the many competing
SNP evaluation tools utilize 3D structure (which is seldom available,
exceedingly complex with nmr ensembles, and not always informative).

Here UCSC already provides a phenomenal tool for SNP evaluation (the
mis-named proteinFasta link on this same page) but you aren't utilizing it.

I would guess at most a few percent the 20,000 human genes have structural
experimental data derived from physical human protein. More typical is a
partial match to another species with varying levels of homology and
alignable fraction. In most cases, all significant Blastp matches at PDB
are to non-human entries, and there often just to a sub-domain. These are
nonetheless exceedingly useful in evaluating a given residue change. I am
really sceptical that ab initio structure calculations like modBase have
anything to offer to the average user for SNP interpretation at this time.

So it would be useful to provide a little Blastp output table with the
first 3-4 matches, the species source, percent identity, and region of
alignment (and best the alignment itself). This PDB Blastp should logically
originate over at GeneSorter (which already carries blastp alignments but
to human proteome rather than PDB) and copied over. Note GeneSorter carries
the rs5656532 SNPs but links to useful but highly verbose NCBI description
tables rather than Chimera.

The description page SNP section should also link to NCBI as it has the
mission-critical human population frequency data -- the very first item lab
people want. Both GeneSorter and description page need to always provide
the substitution at the amino acid level, eg T67D. No one finds any meaning
in the rs4567890 numbers -- it is merely an indexing field.

A pity that they did not build something meaningful into the rs names. Now
that actually is an opportunity for UCSC, to append something useful (like
C to G, arg to trp at position 43 in the ref seq at 3% frequency at 1000
HGPj: .004303CGRW.rs4567890. Then the visitor can see at a glance what it
is without a tedious chase. And the SNPs for a given gene sort nicely by
position, frequency and type.

On a similar note, to avoid tedious manual edits, both the details page and
GeneSorter should de-emphasize the unmemorable gene indexing number. A
whole lot of pages on the browser have lost track of the gene name and are
just providing the in-house gene name, eg

>uc003bll.1 (MIOX) length=285  should read >MIOX length=285 uc003bll.1
MKVTVGPDPSLVYRPDVDPEVAK...                 MKVTVGPDPSLVYRPDVDPEVAK...

Thus for universal SNP evaluation, it is better to begin with comparative
genomics conservation data since that is always available for every SNP
regardless of protein size or membrane association (eg GPCR receptors) and
is exploding in quality from next gen sequencing (both number of species
and population frequencies in human), unlike xray/nmr which are many
decades away from coverage. UCSC is already precomputing this for other
reasons so it is not an additional compute burden.

On the description page, you could display a teaser for the conservation
environment of known to a fixed phylogenetic depth and point to the fuller
proteinFasta with the SNPs displayed in a line above human. proteinFasta
needs a differential display mode (shown below) and some rudimentary
statistics gathering (saveable floating popup over each aa). Here
 everything you need is already available in an existing table. In the
example (which I truncated to Euarchonta), you can already see that L80F is
going to wreck the protein. Nature already experimented with F at this
position and found it didn't work. For personal genomics, it doesn't really
matter why ... the 3D structural explanation (available with a lot of hard
extra work) is just frosting on the cake.

........... ....F ....  ......................
DHFR_homSap LSREL KEPP  Homo  sapiens  (human)
DHFR_panTro ..... ....  Pan  troglodytes  (chimp)
DHFR_gorGor ..... ....  Gorilla  gorilla  (gorilla)
DHFR_ponAbe V.... .Q..  Pongo  abelii  (orangutan)
DHFR_nomLeu V.... ....  Nomascus  leucogenys  (gibbon)
DHFR_macMul ..... .Q..  Macaca  mulatta  (rhesus)
DHFR_papAnu I.... ....  Papio  anubis  (baboon)
DHFR_calJac ...D. ....  Callithrix  jacchus  (marmoset)
DHFR_tarSyr ..... .V..  Tarsius  syrichta  (tarsier)
DHFR_otogar ..K.. ..S.  Otolemur  garnettii  (bushbaby)
DHFR_micMur ..K.. ..S.  Microcebus  murinus  (lemur)
DHFR_tupBel ..K.. ....  Tupaia  belangeri  (treeshrew)

I finally found a gene where the chimera link worked but my computer
initially had no idea how to open the chimerax file type. Buried in the
help, it explains about the 85 meg download.  I foresee all sorts of
platform problems with this. For example they are already dropping support
for non-Intel Macs. This is making UCSC too dependent on an outside party
over which we have no control. It is giving a lot of weight to one of many
offsite tools for evaluating SNPs. We need to focus on providing the basics.

It is better to use in-house resources so UCSC retains control of quality,
availability, and updating than to send visitors off-site to third-party
links. proteinFasta does a far better job on certain aspects of SNP
evaluation than LS-SNP and many others (more extensive comparative
genomics, correct phylogenetic tree, regular updating for new genomes). But
mainly, we should provide all the easy things before sending people off on
difficult things (molecular dynamics calculation of perturbed structure).

 Here is the first thing the visitor sees after installation when they open
a UCSC link to Chimera -- yuch:

Now Chimera happens do a very nice job of displaying a larger interactive
image in which the visitor can twist and rotate the image with their mouse.
However many other free tools do this as well.  It persists in displaying
the rs51456431 notation for SNPs whereas visitors want V23L notation. I
didn't see how to enter a SNP of my own like RasMol or SwissModel. A
powerful tool for sure but also a way of life.

Back on the description page, it is enough just to list just 1-2 PDB
entries -- the visitor does not need 39 structures for one small protein
(eg PRNP entry). Nmr and xray structural determination often begins by
clipping the protein down to something smaller (which can drop the SNP), so
the numbering (eg 25-124 of 1-237) should be shown. If it is not from
human, the species should be specified and the percent identity.

While there is value in simple visual localization of the SNP within the
secondary and tertiary and especially domain structure, these little
thumbnails do not provide a working environment. I don't see any particular
value in three small views from top, back, side. One is enough, it is just
a toy. It would be more useful to have one graphic,  a depiction (with SNP
locator) of secondary structure (helix, sheet, coil and so forth), and a
simple picture of domain structure. To get fancy, mousing over an rs would
locate it in the latter two pictures.

Some links to Chimera are broken. Example:
http://genome-test.cse.ucsc.edu/trash/lssnp/1E1U_genome_test_182e_d08f70.chimerax
.

Oddly, the public browser still shows modBase for PRNP whereas it has
switched to Chimera- LS-SNP for the gene LDRL. So I guess things are in
flux.

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] feedback on description page: Protein Domain and Structure Information

Reply via email to