Hello Stuart, If I understand your question correctly, you want a single transcript representing a gene that covers the intron regions of two or more transcripts that each separately represent different genes.
Your original query (comparing exons to introns) sounds like the correct way to do this. You did not mention if strand is an issue, but this is probably something that you are already including or excluding in your base query. To solve the multiple-results-per-gene problem, add in one more data point, knownIsoforms.clusterId. This should allow you to group the transcript overlap data by gene (clusterId is how genes are designated in the track UCSC Genes) where you can then group down each gene to a set of candidate transcripts using the knownGene.name identifiers. Then pick one transcript based on your own criteria or use the knownCanonical table's data (explained below). You can add in the extra data point by using output "selected fields from primary and related tables" when pulling the UCSC Genes data from the Table browser. When this is the chosen output type, and the "get output" button is used, another form will come up where linked tables can be added in. Add in the knownIsoform table, submit, then check the clusterId field from that table, followed by a final "get output" under the top primary table (knownGene). You will need to download the data, then reload as a custom track. BED format with a merged knownGene.name and knownIsoform.clusterId in the "name" field would be one way of incorporating the data. Another option is to just output the entire knownIsoform table (only has two columns, clusterId and transcript. Then link this with your current results linking knownGene.name with knownIsoform.transcript (unix join or in Galaxy). To simplify the data further, the table knownCanonical can be used to select one transcript per-cluster (gene) to do your analysis. Either use this as a filter in your original query or filter the results after. Be aware that you may miss some of the genes you are looking fr this way, since the "canonical" transcript for the gene may or may not be the one that spans two different genes according to your criteria. Please note that gene symbols are assigned differently than clusterId (genes) using the UCSC Genes track. See the UCSC Genes track description for the clustering rules. Gene symbols for UCSC Genes are in the table kgAlias (warning: there can be more than one per transcript/cluster). The table kgXref also has alternative identifiers, with the same general warning about linkage (many-to-one, many-to-many, one-to-many). It would be best to stick with clusterId to start with if you are using UCSC Genes, then sort out gene identifiers after the smaller set of candidate transcripts/genes are found. Also note that you could just use the RefSeq Genes track (refGene.name2 is a gene name) to do the analysis, but it would likely limit the results you find. RefSeq Genes is included in UCSC Genes so you should not have to do both. Hopefully one of these data sources will help you to complete your analysis. Using Galaxy and custom tracks may be able to help you to complete the entire analysis, as many of the unix tools that would be useful are available in Galaxy. You may also be able to set up a "work-flow" in Galaxy to do all the steps in one go (after sending the data from the UCSC Table browser), using custom-merged tables or all referenced tables as-is to merge there. See the Galaxy online help or contact their mailing list support if you need help with their tools. Thanks! Jennifer --------------------------------- Jennifer Jackson UCSC Genome Informatics Group http://genome.ucsc.edu/ On 5/17/10 9:48 AM, Brown, Stuart wrote: > > I am trying to figure out a way to find a complete list of genes that are > located inside of the introns of other genes. I know that this is fairly > common for small non-coding RNAs, but we want to find all instances of this > sort of overlap among RefSeq genes (in Drosophila). It is fairly easy to do > an intersection of intervals between all exons and all introns, but that > lists every annotated alternatively spliced exon (every annotated exon that > overlaps any region that is also annotated as an intron in some RefSeq gene > model). What I want is to find exons that overlap introns of DIFFERENT > genes. Is there any way to construct such a query in the Table Browser - or > use a Galaxy tool, a trick in Excel, or anything else anyone can think of? > > Thanks for your thoughts. > > Stuart M. Brown, Ph.D. > Associate Professor > Center for Health Informatics and Bioinformatics > NYU School of Medicine > 550 First Ave, NY, NY 10016 > [email protected] > (212)263-7689 FAX (212) 263-8139 > > > ------------------------------------------------------------ > This email message, including any attachments, is for the sole use of the > intended recipient(s) and may contain information that is proprietary, > confidential, and exempt from disclosure under applicable law. Any > unauthorized review, use, disclosure, or distribution is prohibited. If you > have received this email in error please notify the sender by return email > and delete the original message. Please note, the recipient should check this > email and any attachments for the presence of viruses. The organization > accepts no liability for any damage caused by any virus transmitted by this > email. > ================================= > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
