I've given up temporarily on biomart, and decided I should get my query working on martshell first.
My first question: why is it that when I list datasets, I see hsapiens_gene_ensembl, but not hsapiens_gene_ensembl_structure? Is it somehow a sub-dataset? How am I supposed to know it exists if list datasets doesn't show it? I was excited to get results from a query, like this: MartShell> using hsapiens_gene_ensembl get ensembl_transcript_id where hgnc_symbol in (BRCA1, BRCA2); ENST00000357654 ENST00000380152 ENST00000267071 My second question: the BRCA1 gene (ENSG00000012048) has a ton of transcripts (http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000012048) but on the web page, most of them have an NP number, and only one ENST00000357654, is described as BRCA1. If I want all the exons for ANY transcript of this gene, do I need to first query the gene ID, then query all exons based on that gene ID? I thought that giving an HGNC symbol would return anything associated with the GENE that has that symbol. When I tried to query based on the one transcript ID I had, it failed: MartShell> use hsapiens_gene_ensembl_structure get exon_id where transcript_id in (ENST00000357654); MartShell> use hsapiens_gene_ensembl_structure get exon_id where stable_transcript_id in (ENST00000357654); MartShell> use hsapiens_gene_ensembl_structure get exon_id where str_transcript_id in (ENST00000357654); MartShell> Now, from ensembl.org, it's clear that there are 23 exons with this transcript id. So my third question is, what am I doing wrong here? While I'm at it, what's the difference between transcript_id, stable_transcript_id, and str_transcript_id (same question for gene IDs) and how do I know which filters in hsapiens_ensembl_gene_structure match up with attributes in hsapiens_ensembl_gene? I'd better stop before I ask too many questions. - Amir Karger Computational Biology Group Bauer Center for Genomics Research Harvard University 617-496-0626
