Hi Dan,

After consulting with our scientists, it is confirmed that we do not 
have a complete solution, but there are a few suggestions for data 
manipulation before and after the coordinate transform of transcript 
offsets to genomic coordinates.

* Translate symbols to genomic coordinates using the Table browser. To 
do this, navigate to the track UCSC Genes, paste/upload gene symbols 
(uniProt IDs, RefSeq NM_* accs, etc.) as IDs, GTF output because it has 
one line per exon.  The GTF will have UCSC Gene uc* IDs.  To translate 
those back to recognizable symbols, do a kgAlias query with the same 
pasted/uploaded gene symbols and use Galaxy to join them back up.
http://genome.ucsc.edu/cgi-bin/hgTables
http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html

* Then use the exon coordinates to translate the transcript coordinates 
into genomic coordinates (again, using UCSC Genes transcript identifiers 
(uc*) back into gene symbols symbols (kgAlias again).  Strand will 
complicate this a bit and you will need to develop a tool to do this 
part on your own, follow these instructions for translating coordinates 
in the UCSC system:
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

* Export the dbSNP primary table to Galaxy (perhaps filtering to keep 
only single-base SNPs with single-base mappings: class=single and 
locType=exact), and then perform and interval join between the genomic 
coordinates and the dbSNP genomic coordinates (you can filter one track 
against another in the Table browser, but only rows from the main table 
are retained, a full interval join in Galaxy will retain all data, 
including IDs, from both datasets).

Some potential issues to consider:

* Transcript starts are often not the same in one publication vs. 
another. Even for the same transcript, the alignment may differ slightly 
due to the methods used (Blast vs BLAT or other).

* Multiple isoforms per gene are in the UCSC Genes track. The table 
knownIsoforms has a field called clusterID that can be interpreted as a 
GeneID and the table knownCanonical names the transcript that we 
consider to be "representative" for any cluster/gene. However, which 
transcript was actually used by the authors of the publication will need 
to be examined. Ideally, one and only one transcript will match the 
transcript from the publication (using the RefSeq ID, or the number of 
bases, or one of the other gene aliases).

* One possible way to test that the correct transcript is being used 
would be to compare the observed column of the SNP table (observed 
alleles) to the published alleles, filter out any SNPs with mismatching 
observed alleles, and then resolve any remaining multiple-mappings manually.


This should leave you with a complete transcript, named by gene symbol, 
along with any linked dbSNP rs# identifiers - all anchored by genome 
position. The final step of creating a variation transcript fasta 
sequence by swapping in the SNP allele will be something that you will 
need to develop a tool to do. This transformation is the next logical 
step for data analysis, followed by the translation of the new protein 
(if the SNP is within a coding region), and the characterization of any 
non-synonymous changes.

You might examine the tools at dbSNP or Galaxy to see if there are tools 
that will do some of these steps.

Thanks,
Jennifer


---------------------------------
Jennifer Jackson
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu/

On 2/15/10 3:31 PM, Dan Rich wrote:
> Hi,
>
> I'm trying to figure out whether there's an existing semi-automated 
> process/software that can translate many human gene mutations/polymorphisms 
> into dbSNP entries (when they exist) presumably via resolving them first to 
> protein and/or gene Refseq offsets (since the literature does not provide the 
> flanking sequences needed to map with BLAST).
>
> Here's an example: "Amir et al. (1999) identified a 390C-T transition in the 
> MECP2 gene, resulting in an arg106-to-trp (R106W) substitution.” which maps 
> to  dbSNP rs28934907 
> (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=28934907). Is there existing 
> software or a semi-automated process that can take something like 'R106W' 
> substitution mutation in human MECP2 gene (with mutation format adjustments 
> or translated into Entrez Gene ID as needed) and produce either the MECP2 
> gene Refseq flanking sequence to map to dbSNP/genome or a dbSNP entry 
> directly?
>
> Dan
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to