Hi Tom, Thank you for your input on the gene details pages. I have passed on your recommendations to our engineers.
Vanessa Kirkup Swing UCSC Genome Bioinformatics Group ---------- Forwarded message ---------- From: thomas pringle <[email protected]> Date: Sun, Jan 22, 2012 at 9:03 AM Subject: [Genome] 9 bugs on gene description page To: [email protected] Cc: David Haussler <[email protected]>, Donna Karolchik <[email protected]> I may have sent in some version of 7-9 previously, included here since on the same gene details page. 1. The 'Orthologous Genes' table has lumped C.elegans and S. cerevisae into a single column. The code may just need another <td> in each row. However, yesterday zebrafish and D.melanogaster were also lumped into a single table cell so more may be going on. (see attached graphic). 2. The 'Orthologous Genes' table is really questionable when it comes to fly, worm and yeast. Orthology has a long-standing fixed definition. That definition is not best-reciprocal Blastp -- a sloppy proxy that we use because of computational convenience. Orthology is already difficult to establish computationally between human and mouse (see below) and seldom reliable outside of mammals without extensive manual curation. With fly, worm and yeast, there is almost never syntenic support. As with human, these species have experienced large numbers of gene gains and losses; yeast had a old whole genome duplication. This gives rise to sets of paralogs with highly variable rates of evolution and dramatic changes in function, making them very difficult to compare. The table should say 'best reciprocal Blastp' (BRBp for brevity) in the table cells if that's all it is, not 'ortholog' . If there is nothing to put in the cell, '---' is perhaps better than text clutter. 3. Also in the 'Orthologous Genes' table, it says "Orthologies between human, mouse, and rat are computed by taking the best BLASTP hit..." Here the target protein database has to be specified. For example, I'm not seeing a rat ortholog entry for ERI3 despite 98% identity between rat, mouse and human. It has been represented since 2008 at GenBank as "AAI67080 unknown protein [Rattus norvegicus]". Rat annotation at NCBI appears to be a total joke. Even slamdunks like ERI3 have not been given a RefSeq in the 7 years since the genome was released. Obviously NCBI has no intention of annotating rat. However it is correctly annotated in the browser by RGD Genes 2. ERI3 is not listed at rat GeneSorter, suggesting RefSeq was used there too but not RGD Genes 2. Since there is no entry in the 'Orthologous Genes' table, we seemed to have used only refSeq and not consulted RGD Genes 2 or GenBank. That should be stated because otherwise visitors will assume we looked around for rat gene annotation. So here we are, providing visitors with obsolete, misleading product that will never be fixed by updates. I recommend either doing rat right or dropping it from this table since its orthologs including ERI3 are already done correctly at Protein Fasta. What I am really writing about here is the slow accrual of legacy crud. We have to be very careful about hosting orphaned products that people have lost interest in. Automatic updating will accomplish nothing here since NCBI is not updating rat RefSeq. There is no trajectory converging on correctness. 4. Also in the 'Orthologous Genes' table, not ok: "Note that the absence of an ortholog in the table below may reflect incomplete annotations in the other species rather than a true absence of the orthologous gene." The previous sentence just said we used reciprocal best-Blastp. That suffices. If one of the species lacks a gene model, reciprocal best-Blastp obviously could not have been conducted. Worm, fly and yeast are small genomes that were exhaustively annotated many years ago. It is impossible that any protein with signficant homology to human has been overlooked. That's because they repeatedly do Blastx of their entire genome to all proteins in GenBank. Meanwhile, human has been over-annotated: not just the genuine coding genes but thousands of pseudogenes and junk transcripts. Human may have lost a few hundred genes but these will still show up as good worm/fly/yeast matches in the other vertebrates. This sentence should be replaced with "The absence of ortholog or best reciprocal Blastp entry in the table below may reflect a genuine lack of candidates, multiple paralogous matches of indistinguishable low quality, sub-threshhold percent identity (<25% or whatever was used), less than full-length matches (<80% or whatever was used) with chimeric domain proteins, unannotated pseudogene debris, or gene loss in the clade representative used." (An example of the latter would be URAH, pseudogene debris in human but with good orthologs between gorilla and mouse). These tables are only useful to visitors if they know how they were made. 5. Zebrafish has to be treated differently from the others in the 'Orthologous Genes' table. First, it has no GeneSorter. One solution is to drop it entirely: the browser annotation is awful; the genome assembly has dragged on for a decade, never attaining high quality. Protein divergence to human is generally high; lineage-specific gene family expansion is rampant; syntenic retention is rare. It is redundant in this table because we already have a whole genome alignment best-guess at Protein Fasta. And everything is a model organism today. The biggest problem is that whole genome duplication makes a meaningful choice of ortholog systemically impractical. What visitors want here, in view of the whole genome duplication, is whether both copies were retained. If only one copy was retained, orthologous correspondence to human is clear. If both copies were retained, the correspondence becomes very murky (co-orthology). We need to double up on the zebrafish column. Best Blastp does not work in this situation. What we are doing now -- picking one, not mentioning the second -- is a disservice to the visitor. 6. The worst bug of all is in the 'Orthologous Genes' table. We did not really filter out non-syntentic hits in mouse and rat. "Filtering out of non-syntenic hits" should be changed to what we actually did here operationally. Visitors cannot use our material without an explanation of methods. I provide below a counter-example, PRDM9, that proves syntenic filtering was not done (see attached graphic). The attached mouse and human browser screenshots show human and mouse 'PRDM9' are not remotely syntenic. The flanking genes do not correspond. This is actually the mouse ortholog of human PRDM7. Synteny ('same thread') refers to conserved adjacency of potentially rearrangeable genes orother features on a chromosome. The minimum unit is two genes. There is a great risk of cross-matching paralogs and larger segmental duplications. Synteny does not refer to parts of a single gene such as exons, introns, and promoter regions because these are not units that can be routinely shuffled by chromosomal rearrangements with retention of function. Synteny does not refer to best whole-genome-alignment of two species, restricted to a single gene and its internal nucleotides. That is called best-Blastn. I'm guessing the procedure used took the comparative genomics track in the human browser, ie the one that shows the mouse chrs, then intersected with the mouse gene table. That won't work, faulty algorithm. There have to be multiple genes in the contiguous patch of syntenic chr, each normally best-blastp. 7. When a visitor has landed by whatever route on a gene description page, a natural thing to do next is visit GeneSorter (banner menu). Here GeneSorter should default to the gene on the description page. It does not, it defaults only to the GeneSorter gateway page. The visitor then has to go back, scrape off the gene name, forward to the gateway page, move the mouse to the text box, paste the gene name, hit return. This is inconsistent with our overall 'smart' interface that carries database fields along with page clicks and inserts them appropriately. In the 'Orthologous Genes' table on this same page, clicking on GeneSorter in the mouse ortholog column already does the right thing. It's just human that is broken. 8. Protein Fasta desparately needs to be renamed. Its current name does not describe what it is. Visitors do not have time to explore cryptic links. This is causing one of the most important pages on the entire browser to be greatly under-utilized. Please change to Aligned Orthologs or another appropriately descriptive name (check w Brian Rainey). 9. When a visitor has landed on a gene description page, another natural thing to do next is visit Protein Fasta . Here the browser should display the gene name as well as the uc index number. For example, "Human Gene RHOT1 (uc002hgw.3) Description and Page Index" should go over to "Protein Alignments for Human Gene RHOT1 (uc002hgw.3)" on the Protein Fasta page. Right now, it goes over to just "Protein Alignments for knownGene uc002hgw.3". Here the visitor harvests the fasta sequences, then has to go back a window, scrape off the gene name, search and replace the uc name with the gene name in the harvested sequences which should have empasized the gene name to begin with. (The uc name is just a redundant inhouse indexing system; gene names mean something to biomedical researchers.) The current set-up is again inconsistent with our overall 'smart' interface. _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
