Hello Stuart,

If I understand your question correctly, you want a single transcript 
representing a gene that covers the intron regions of two or more 
transcripts that each separately represent different genes.

Your original query (comparing exons to introns) sounds like the correct 
way to do this. You did not mention if strand is an issue, but this is 
probably something that you are already including or excluding in your 
base query. To solve the multiple-results-per-gene problem, add in one 
more data point, knownIsoforms.clusterId. This should allow you to group 
the transcript overlap data by gene (clusterId is how genes are 
designated in the track UCSC Genes) where you can then group down each 
gene to a set of candidate transcripts using the knownGene.name 
identifiers. Then pick one transcript based on your own criteria or use 
the knownCanonical table's data (explained below).

You can add in the extra data point by using output "selected fields 
from primary and related tables" when pulling the UCSC Genes data from 
the Table browser. When this is the chosen output type, and the "get 
output" button is used, another form will come up where linked tables 
can be added in. Add in the knownIsoform table, submit, then check the 
clusterId field from that table, followed by a final "get output" under 
the top primary table (knownGene).

You will need to download the data, then reload as a custom track. BED 
format with a merged knownGene.name and knownIsoform.clusterId in the 
"name" field would be one way of incorporating the data.

Another option is to just output the entire knownIsoform table (only has 
two columns, clusterId and transcript. Then link this with your current 
results linking knownGene.name with knownIsoform.transcript (unix join 
or in Galaxy).

To simplify the data further, the table knownCanonical can be used to 
select one transcript per-cluster (gene) to do your analysis. Either use 
this as a filter in your original query or filter the results after. Be 
aware that you may miss some of the genes you are looking fr this way, 
since the "canonical" transcript for the gene may or may not be the one 
that spans two different genes according to your criteria.

Please note that gene symbols are assigned differently than clusterId 
(genes) using the UCSC Genes track. See the UCSC Genes track description 
for the clustering rules. Gene symbols for UCSC Genes are in the table 
kgAlias (warning: there can be more than one per transcript/cluster). 
The table kgXref also has alternative identifiers, with the same general 
warning about linkage (many-to-one, many-to-many, one-to-many). It would 
be best to stick with clusterId to start with if you are using UCSC 
Genes, then sort out gene identifiers after the smaller set of candidate 
transcripts/genes are found.

Also note that you could just use the RefSeq Genes track (refGene.name2 
is a gene name) to do the analysis, but it would likely limit the 
results you find. RefSeq Genes is included in UCSC Genes so you should 
not have to do both.

Hopefully one of these data sources will help you to complete your 
analysis. Using Galaxy and custom tracks may be able to help you to 
complete the entire analysis, as many of the unix tools that would be 
useful are available in Galaxy. You may also be able to set up a 
"work-flow" in Galaxy to do all the steps in one go (after sending the 
data from the UCSC Table browser), using custom-merged tables or all 
referenced tables as-is to merge there. See the Galaxy online help or 
contact their mailing list support if you need help with their tools.

Thanks!
Jennifer

---------------------------------
Jennifer Jackson
UCSC Genome Informatics Group
http://genome.ucsc.edu/

On 5/17/10 9:48 AM, Brown, Stuart wrote:
>
> I am trying to figure out a way to find a complete list of genes that are 
> located inside of the introns of other genes. I know that this is fairly 
> common for small non-coding RNAs, but we want to find all instances of this 
> sort of overlap among RefSeq genes (in Drosophila).  It is fairly easy to do 
> an intersection of intervals between all exons and all introns, but that 
> lists every annotated alternatively spliced exon (every annotated exon that 
> overlaps any region that is also annotated as an intron in some RefSeq gene 
> model).  What I want is to find exons that overlap introns of DIFFERENT 
> genes. Is there any way to construct such a query in the Table Browser - or 
> use a Galaxy tool, a trick in Excel, or anything else anyone can think of?
>
> Thanks for your thoughts.
>
> Stuart M. Brown, Ph.D.
> Associate Professor
> Center for Health Informatics and Bioinformatics
> NYU School of Medicine
> 550 First Ave, NY, NY 10016
> [email protected]
> (212)263-7689   FAX (212) 263-8139
>
>
> ------------------------------------------------------------
> This email message, including any attachments, is for the sole use of the 
> intended recipient(s) and may contain information that is proprietary, 
> confidential, and exempt from disclosure under applicable law. Any 
> unauthorized review, use, disclosure, or distribution is prohibited. If you 
> have received this email in error please notify the sender by return email 
> and delete the original message. Please note, the recipient should check this 
> email and any attachments for the presence of viruses. The organization 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
> =================================
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to