Re: [Genome] Canonical RefSeq

Jennifer Jackson Fri, 02 Apr 2010 19:32:37 -0700

Hello Rathi,

Unfortunately, the RefSeq dataset does not contain a "canonical" 
transcript designation. There is clustering by gene however (name2 in 
the primary table).

There are a few choices, as you bring up:

1) Select a canonical transcript yourself (longest, most exon, furthest 
5' reaching, or similar).

2) Use UCSC Genes to cluster and that canonical for your analysis. All 
of RefSeq is included in UCSC Genes, but the canonical may or may not be 
a RefSeq. So even this method would require some independent analysis.

For the merging question, perhaps use the RefSeq gene name and only 
compare RefSeqs assigned to the same gene when collapsing the redundancy 
in Galaxy. You could alternatively use the UCSC gene "cluster" name and 
the linked gene symbols (and discard the RefSeq assignments). If you try 
to merge the two clustering methods, it is almost certain that some 
regions will have more than one gene assigned.

This is obvious - but just to be sure it is considered - make sure 
strand is taken into account for any interval merges. This would not be 
an issue if you are already clustering by gene (all isoforms would 
*hopefully* be on the same strand already), but strand would definitely 
be important to consider, if you are clustering by reference genome 
position/footprint, before doing the redundancy analysis.

Some outliers should probably be expected for footprint-based analysis: 
genes within other genes, partially interleaved regions, that sort of 
thing. This is likely where your issue with the redundancy clustering 
producing more than one gene name came from. You will have to decide how 
to sort these out if you choose not to set up the analysis per-gene in 
the beginning. Even the UCSC Genes track has interleaved 
transcript/genes present (on purpose).

Hopefully this gives you some more information about the RefSeq track 
and some clustering/analysis ideas,

thanks
jen

---------------------------------
Jennifer Jackson
UCSC Genome Informatics Group
http://genome.ucsc.edu/

On 4/2/10 6:03 PM, Rathi Thiagarajan wrote:
> Hi there,
>
> Could you please advice me the best way to obtain (mm9) RefSeq canonical
> transcripts genomic intervals? I see that there is a table for UCSC genes
> "knownCanonical", but I was wondering if there was something similar just
> for RefSeq? I could just filter for UCSC genes with linked RefSeq ID's but
> was wondering if there was a better way?
>
> Also is it possible to get a non-redundant set of RefSeq exons while still
> retaining the Gene Name information? I have tried to merge the exon
> genomic intervals within Galaxy, but it doesn't return the gene names.
> Bascially, my goal is to get a RefSeq-based locus information either
> through non-redundant exons or non-redundant whole gene co-ordinates.
>
> Thanking you in advance.
>
> Cheers,
> Rathi
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Canonical RefSeq

Reply via email to