Hi Dan, After consulting with our scientists, it is confirmed that we do not have a complete solution, but there are a few suggestions for data manipulation before and after the coordinate transform of transcript offsets to genomic coordinates.
* Translate symbols to genomic coordinates using the Table browser. To do this, navigate to the track UCSC Genes, paste/upload gene symbols (uniProt IDs, RefSeq NM_* accs, etc.) as IDs, GTF output because it has one line per exon. The GTF will have UCSC Gene uc* IDs. To translate those back to recognizable symbols, do a kgAlias query with the same pasted/uploaded gene symbols and use Galaxy to join them back up. http://genome.ucsc.edu/cgi-bin/hgTables http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html * Then use the exon coordinates to translate the transcript coordinates into genomic coordinates (again, using UCSC Genes transcript identifiers (uc*) back into gene symbols symbols (kgAlias again). Strand will complicate this a bit and you will need to develop a tool to do this part on your own, follow these instructions for translating coordinates in the UCSC system: http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms * Export the dbSNP primary table to Galaxy (perhaps filtering to keep only single-base SNPs with single-base mappings: class=single and locType=exact), and then perform and interval join between the genomic coordinates and the dbSNP genomic coordinates (you can filter one track against another in the Table browser, but only rows from the main table are retained, a full interval join in Galaxy will retain all data, including IDs, from both datasets). Some potential issues to consider: * Transcript starts are often not the same in one publication vs. another. Even for the same transcript, the alignment may differ slightly due to the methods used (Blast vs BLAT or other). * Multiple isoforms per gene are in the UCSC Genes track. The table knownIsoforms has a field called clusterID that can be interpreted as a GeneID and the table knownCanonical names the transcript that we consider to be "representative" for any cluster/gene. However, which transcript was actually used by the authors of the publication will need to be examined. Ideally, one and only one transcript will match the transcript from the publication (using the RefSeq ID, or the number of bases, or one of the other gene aliases). * One possible way to test that the correct transcript is being used would be to compare the observed column of the SNP table (observed alleles) to the published alleles, filter out any SNPs with mismatching observed alleles, and then resolve any remaining multiple-mappings manually. This should leave you with a complete transcript, named by gene symbol, along with any linked dbSNP rs# identifiers - all anchored by genome position. The final step of creating a variation transcript fasta sequence by swapping in the SNP allele will be something that you will need to develop a tool to do. This transformation is the next logical step for data analysis, followed by the translation of the new protein (if the SNP is within a coding region), and the characterization of any non-synonymous changes. You might examine the tools at dbSNP or Galaxy to see if there are tools that will do some of these steps. Thanks, Jennifer --------------------------------- Jennifer Jackson UCSC Genome Bioinformatics Group http://genome.ucsc.edu/ On 2/15/10 3:31 PM, Dan Rich wrote: > Hi, > > I'm trying to figure out whether there's an existing semi-automated > process/software that can translate many human gene mutations/polymorphisms > into dbSNP entries (when they exist) presumably via resolving them first to > protein and/or gene Refseq offsets (since the literature does not provide the > flanking sequences needed to map with BLAST). > > Here's an example: "Amir et al. (1999) identified a 390C-T transition in the > MECP2 gene, resulting in an arg106-to-trp (R106W) substitution.” which maps > to dbSNP rs28934907 > (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=28934907). Is there existing > software or a semi-automated process that can take something like 'R106W' > substitution mutation in human MECP2 gene (with mutation format adjustments > or translated into Entrez Gene ID as needed) and produce either the MECP2 > gene Refseq flanking sequence to map to dbSNP/genome or a dbSNP entry > directly? > > Dan > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
