Hi Tom,

Thank you for your input on the gene details pages.  I have passed on
your recommendations to our
engineers.

Vanessa Kirkup Swing
UCSC Genome Bioinformatics Group



---------- Forwarded message ----------
From: thomas pringle <[email protected]>
Date: Sun, Jan 22, 2012 at 9:03 AM
Subject: [Genome] 9 bugs on gene description page
To: [email protected]
Cc: David Haussler <[email protected]>, Donna Karolchik
<[email protected]>


I may have sent in some version of 7-9 previously, included here since
on the same gene details page.

1. The 'Orthologous Genes' table has lumped C.elegans and S. cerevisae
into a single column. The code may just need another <td> in each row.
However, yesterday zebrafish and D.melanogaster were also lumped into
a single table cell so more may be going on. (see attached graphic).

2. The 'Orthologous Genes' table is really questionable when it comes
to fly, worm and yeast. Orthology has a long-standing fixed
definition. That definition is not best-reciprocal Blastp -- a sloppy
proxy that we use because of computational convenience. Orthology is
already difficult to establish computationally between human and mouse
(see below) and seldom reliable outside of mammals without extensive
manual curation.

With fly, worm and yeast, there is almost never syntenic support. As
with human, these species have experienced large numbers of gene gains
and losses; yeast had a old whole genome duplication. This gives rise
to sets of paralogs with highly variable rates of evolution and
dramatic changes in function, making them very difficult to compare.

The table should say 'best reciprocal Blastp' (BRBp for brevity) in
the table cells if that's all it is, not 'ortholog' . If there is
nothing to put in the cell, '---' is perhaps better than text clutter.

3. Also in the 'Orthologous Genes' table, it says "Orthologies between
human, mouse, and rat are computed by taking the best BLASTP hit..."
Here the target protein database has to be specified. For example, I'm
not seeing a rat ortholog entry for ERI3 despite 98% identity between
rat, mouse and human. It has been represented since 2008 at GenBank as
"AAI67080  unknown protein [Rattus norvegicus]".

Rat annotation at NCBI appears to be a total joke. Even slamdunks like
ERI3 have not been given a RefSeq in the 7 years since the genome was
released. Obviously NCBI has no intention of annotating rat. However
it is correctly annotated in the browser by RGD Genes 2. ERI3 is not
listed at rat GeneSorter, suggesting RefSeq was used there too but not
RGD Genes 2. Since there is no entry in the 'Orthologous Genes' table,
we seemed to have used only refSeq and not consulted RGD Genes 2 or
GenBank. That should be stated because otherwise visitors will assume
we looked around for rat gene annotation.

So here we are, providing visitors with obsolete, misleading product
that will never be fixed by updates. I recommend either doing rat
right or dropping it from this table since its orthologs including
ERI3 are already done correctly at Protein Fasta.

What I am really writing about here is the slow accrual of legacy
crud. We have to be very careful about hosting orphaned products that
people have lost interest in. Automatic updating will accomplish
nothing here since NCBI is not updating rat RefSeq. There is no
trajectory converging on correctness.

4. Also in the 'Orthologous Genes' table, not ok: "Note that the
absence of an ortholog in the table below may reflect incomplete
annotations in the other species rather than a true absence of the
orthologous gene." The previous sentence just said we used reciprocal
best-Blastp. That suffices. If one of the species lacks a gene model,
reciprocal best-Blastp obviously could not have been conducted.

Worm, fly and yeast are small genomes that were exhaustively annotated
many years ago. It is impossible that any protein with signficant
homology to human has been overlooked. That's because they repeatedly
do Blastx of their entire genome to all proteins in GenBank.
Meanwhile, human has been over-annotated: not just the genuine coding
genes but thousands of pseudogenes and junk transcripts. Human may
have lost a few hundred genes but these will still show up as good
worm/fly/yeast matches in the other vertebrates.

This sentence should be replaced with "The absence of ortholog or best
reciprocal Blastp entry in the table below may reflect a genuine lack
of candidates, multiple paralogous matches of indistinguishable low
quality, sub-threshhold percent identity (<25% or whatever was used),
less than full-length matches (<80% or whatever was used) with
chimeric domain proteins, unannotated pseudogene debris, or gene loss
in the clade representative used." (An example of the latter would be
URAH, pseudogene debris in human but with good orthologs between
gorilla and mouse). These tables are only useful to visitors if they
know how they were made.

5. Zebrafish has to be treated differently from the others in the
'Orthologous Genes' table. First, it has no GeneSorter. One solution
is to drop it entirely: the browser annotation is awful; the genome
assembly has dragged on for a decade, never attaining high quality.
Protein divergence to human is generally high; lineage-specific gene
family expansion is rampant; syntenic retention is rare. It is
redundant in this table because we already have a whole genome
alignment best-guess at Protein Fasta. And everything is a model
organism today.

The biggest problem is that whole genome duplication makes a
meaningful choice of ortholog systemically impractical. What visitors
want here, in view of the whole genome duplication, is whether both
copies were retained. If only one copy was retained, orthologous
correspondence to human is clear. If both copies were retained, the
correspondence becomes very murky (co-orthology).  We need to double
up on the zebrafish column. Best Blastp does not work in this
situation. What we are doing now -- picking one, not mentioning the
second -- is a disservice to the visitor.

6. The worst bug of all is in the 'Orthologous Genes' table. We did
not really filter out non-syntentic hits in mouse and rat. "Filtering
out of non-syntenic hits" should be changed to what we actually did
here operationally. Visitors cannot use our material without an
explanation of methods.

I provide below a counter-example, PRDM9, that proves syntenic
filtering was not done (see attached graphic). The attached mouse and
human browser screenshots show human and mouse 'PRDM9' are not
remotely syntenic. The flanking genes do not correspond. This is
actually the mouse ortholog of human PRDM7.

Synteny ('same thread') refers to conserved adjacency of potentially
rearrangeable genes orother  features on a chromosome. The minimum
unit is two genes. There is a great risk of cross-matching paralogs
and larger segmental duplications.

Synteny does not refer to parts of a single gene such as exons,
introns, and promoter regions because these are not units that can be
routinely shuffled by chromosomal rearrangements with retention of
function.

Synteny does not refer to best whole-genome-alignment of two species,
restricted to a single gene and its internal nucleotides. That is
called best-Blastn.

I'm guessing the procedure used took the comparative genomics track in
the human browser, ie the one that shows the mouse chrs, then
intersected with the mouse gene table. That won't work, faulty
algorithm. There have to be multiple genes in the contiguous patch of
syntenic chr, each normally best-blastp.

7. When a visitor has landed by whatever route on a gene description
page, a natural thing to do next is visit GeneSorter (banner menu).
Here GeneSorter should default to the gene on the description page. It
does not, it defaults only to the GeneSorter gateway page. The visitor
then has to go back, scrape off the gene name, forward to the gateway
page, move the mouse to the text box, paste the gene name, hit return.
This is inconsistent with our overall 'smart' interface that carries
database fields along with page clicks and inserts them appropriately.

In the 'Orthologous Genes' table on this same page, clicking on
GeneSorter in the mouse ortholog column already does the right thing.
It's just human that is broken.

8. Protein Fasta desparately needs to be renamed. Its current name
does not describe what it is. Visitors do not have time to explore
cryptic links. This is causing one of the most important pages on the
entire browser to be greatly under-utilized. Please change to Aligned
Orthologs or another appropriately descriptive name (check w Brian
Rainey).

9. When a visitor has landed on a gene description page, another
natural thing to do next is visit Protein Fasta . Here the browser
should display the gene name as well as the uc index number. For
example, "Human Gene RHOT1 (uc002hgw.3) Description and Page Index"
should go over to "Protein Alignments for Human Gene RHOT1
(uc002hgw.3)" on the Protein Fasta page. Right now, it goes over to
just "Protein Alignments for knownGene uc002hgw.3".

Here the visitor harvests the fasta sequences, then has to go back a
window, scrape off the gene name, search and replace the uc name with
the gene name in the harvested sequences which should have empasized
the gene name to begin with. (The uc name is just a redundant inhouse
indexing system; gene names mean something to biomedical researchers.)
The current set-up is again inconsistent with our overall 'smart'
interface.






_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to