Hi Bogdan
On Mon, 11 Jun 2007, Bogdan wrote:
Dear Damian,
thank you for your reply.
ok - we are improving the user warning and images for the forthcoming
release :-) Downstream flank refers to the "downstream of the gene". As it
doesn't really make sense to join the upstream and downstream flanks when
just selecting flanks we disabling using them both together - it just
returns the upstream flank as you experienced. Apologies for the confusion
for "flank-coding region" for both "gene" and "transcript" the image
does show both upstream and downstream flanks as those of the
gene/transcript, but for "flank" only the upstream sequence is
highlighted on the image - that was the source of confusion in my
case.
yes - we have changed the image for the upcoming release this week to have
both flanks.
> <Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
> <Attribute name = "gene_stable_id" />
> <Attribute name = "coding_gene_flank" />
> <Filter name = "upstream_flank" value = "1000"/>
> <Filter name = "transcript_status" value = "KNOWN"/>
> <Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
> </Dataset>
this should give you your 1000bp upstream of the TSS - is it not doing
this? or are you looking for something different? Let me know and will try
and help
It does give 1kbp upstream, but I'm looking for the 1kbp up from TSS
*plus* a stretch of sequence down from TSS to the translation start
site (i.e. 5'UTR). I can do this with the following sample query:
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
<Attribute name = "gene_stable_id" />
<Attribute name = "5utr" />
<Attribute name = "5utr_start" />
<Attribute name = "5utr_end" />
<Attribute name = "transcript_chrom_strand" />
<Filter name = "upstream_flank" value = "1000"/>
<Filter name = "transcript_status" value = "KNOWN"/>
<Filter name = "ensembl_gene_id" value = "ENSRNOG00000014029"/>
</Dataset>
but I have a problem interpreting the results for the genes with
multiple 5'UTRs defined (like the ENSRNOG00000014029 in the sample
query above).
I was confused with that for a while but I think I see what is happening.
There is one gene, one transcript and one 5utr as one would expect. You
are seeing three start and end positions for the 5utr start and end
attributes but I believe this is because the UTR runs over more than one
exon. If you concatenate the sequence returned for these 3 pairs you end
up with the 5utr sequence returned by BioMart. I understand teh confusion
- you would expect 5utr start and end to give a single pair of coordinates
per utr.
Does that make sense now or am I missing something?
Thanks again for the feedback
Damian
I do not understand what should multiple 5'UTRs mean for a single
gene. Based on query results, it appears that UTRs are linked to the
gene, and not to the gene transcripts. Thus, multiple 5'UTRs shouldn't
mean the UTRs of transcripts. Then what sequence do I get with the
following query, issued for the multiple-5'UTR gene?
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
<Attribute name = "gene_stable_id" />
<Attribute name = "5utr" />
<Attribute name = "transcript_chrom_strand" />
<Filter name = "transcript_status" value = "KNOWN"/>
<Filter name = "ensembl_gene_id" value = "ENSRNOG00000014029"/>
</Dataset>
I attempted aligning the sequence returned by this query to the
"Unspliced (Gene)" sequence from the same gene, and there were 379bp
of identities followed by some 122bp of non-identical sequence (full
length of 5'UTR returned is 501bp). Hence the question on _what
exactly_ is returned by the 5'utr-query?
Thank you for your answer,