On 11 May 2006, at 17:16, Amir Karger wrote:
I've given up temporarily on biomart, and decided I should get my query
working on martshell first.
My first question: why is it that when I list datasets, I see
hsapiens_gene_ensembl, but not hsapiens_gene_ensembl_structure? Is it
somehow a sub-dataset? How am I supposed to know it exists if list
datasets doesn't show it?
You can't query invisible datasets directly in MartShell (just like
you cannot do it
in MartView either) . The available datasets can be found below:
MartShell> list datasets;
agambiae_gene_ensembl
amellifera_gene_ensembl
btaurus_gene_ensembl
celegans_gene_ensembl
cfamiliaris_gene_ensembl
cintestinalis_gene_ensembl
dmelanogaster_gene_ensembl
drerio_gene_ensembl
frubripes_gene_ensembl
ggallus_gene_ensembl
hsapiens_gene_ensembl
mdomestica_gene_ensembl
mmulatta_gene_ensembl
mmusculus_gene_ensembl
ptroglodytes_gene_ensembl
rnorvegicus_gene_ensembl
scerevisiae_gene_ensembl
tnigroviridis_gene_ensembl
xtropicalis_gene_ensembl
(only visible are listed)
I was excited to get results from a query, like this:
MartShell> using hsapiens_gene_ensembl get ensembl_transcript_id where
hgnc_symbol in (BRCA1, BRCA2);
ENST00000357654
ENST00000380152
ENST00000267071
My second question: the BRCA1 gene (ENSG00000012048) has a ton of
transcripts
(http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000012048) but
on the web page, most of them have an NP number, and only one
ENST00000357654, is described as BRCA1. If I want all the exons for ANY
transcript of this gene, do I need to first query the gene ID, then
query all exons based on that gene ID? I thought that giving an HGNC
symbol would return anything associated with the GENE that has that
symbol.
I agree that this is more intuitive but Ensembl maps their entries per
transcript rather than per gene. If you want more details about this
mapping
you should contact Ensembl helpdesk ([EMAIL PROTECTED])
When I tried to query based on the one transcript ID I had, it failed:
MartShell> use hsapiens_gene_ensembl_structure get exon_id where
transcript_id in (ENST00000357654);
MartShell> use hsapiens_gene_ensembl_structure get exon_id where
stable_transcript_id in (ENST00000357654);
MartShell> use hsapiens_gene_ensembl_structure get exon_id where
str_transcript_id in (ENST00000357654);
MartShell>
Now, from ensembl.org, it's clear that there are 23 exons with this
transcript id. So my third question is, what am I doing wrong here?
you can't use structure because it is an invisible dataset (see above)
also I can't see exon_id (You can find all available attributes by
using "list attributes" command
or on linux "get <tab><tab>" BTW, martj query library is a bit behind
the perl library and you
can't really use placeholder attributes at the moment. We are planning
martj upgrade soon.
While I'm at it, what's the difference between transcript_id,
stable_transcript_id, and str_transcript_id (same question for gene
IDs)
transcript_stable_id is of the format "ENS...." while transcript id is
internal
numeric database id
and how do I know which filters in hsapiens_ensembl_gene_structure
match
up with attributes in hsapiens_ensembl_gene?
not sure if I understand this question :)
a.
I'd better stop before I ask too many questions.
- Amir Karger
Computational Biology Group
Bauer Center for Genomics Research
Harvard University
617-496-0626
------------------------------------------------------------------------
-------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
------------------------------------------------------------------------
-------