Hi,

I seem to miss the pointer to the manual/description of the
flank[-coding] regions interpretation (definitions) and conventions. I
also couldn't find anything (relevant enough) on the web or in the
mailing-list archive. If the questions I'm asking are documented
somewhere, please let me know.

the sample query I used was:
<Query  virtualSchemaName = "default" header = "0" count = ""
softwareVersion = "0.5" >
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
        <Attribute name = "gene_stable_id" />
        <Attribute name = "coding_gene_flank" />
        <Attribute name = "5utr_start" />
        <Attribute name = "5utr_end" />
        <Attribute name = "transcript_chrom_start" />
        <Attribute name = "transcript_chrom_end" />
        <Attribute name = "transcript_chrom_strand" />
        <Filter name = "upstream_flank" value = "1000"/>
        <Filter name = "transcript_status" value = "KNOWN"/>
        <Filter name = "ensembl_gene_id" value =
"ENSRNOG00000006899,ENSRNOG00000000164"/>
</Dataset></Query>
(with variation "coding_gene_flank" instead of "gene_flank").


1. definitions first.
I would expect that
flank-coding region (gene) = flank (gene) + 5' UTR

So while getting upstream sequence as "flank (gene)" starts from the
TSS of the "leftmost transcript", "flank-coding region (gene)" should
start at the translation initiation site of the "leftmost transcript".
Am I right here?

Issuing two sample queries to biomart webservice, asking for 1 kbase
upstream of "flank (gene)" and "flank-coding region (gene)", I
expected that the resulting sequences would partially overlap (namely,
in the portion right upstream from TSS; "overlap length" = 1000 - "5'
UTR length"). This seems to be the case, when there is only one 5'UTR
region (as indicated by single 5UTR-start and 5UTR-end values, e.g. in
ENSRNOG00000000164).

However, if more than one 5'UTR is defined for the gene, then
"flank-coding" and "flank" do overlap only at higher values of the
'upstream' filter (like 5 kbases or more in e.g. ENSRNOG00000006899:
ENSRNOG00000006899|7748641;7744305|7748650;7744534|7744305|7759380|1

So it appears that in the case of multiple 5' UTRs (and "upstream"
checkbox set), the "flank-coding region (gene)" returns the sequence
starting from the "rightmost" 5' UTR of the "leftmost" transcript. Am
I right in this statement?



2. conventions.
based on some previous discussions (
http://listserver.ebi.ac.uk/mailing-lists-archives/ensembl-dev/msg01227.html
)
and one of the results I got:
ENSRNOG00000007949|ENSRNOT00000010984|13052235;13052765|13052250;13052846|13037171|13052846|-1
it's still confusing to interpret.
Here, 5' UTRs appear to start at positions 13052235;13052765, and end
at 13052250;13052846. Transcript starts at 13037171, ends at 13052846.
Clearly, 5'UTRs' position is reversed for the negative strand (and
thus appears at the "end" of the gene).
Is the earlier discussed "convention" still valid, and I have to
reverse-complement the upstream sequences I get from the negative
strand genes?



3. the problem itself and best method
What I'm attempting to fetch is a fairly small "gene promoter" (less
than 1 kbase).

There are several different options available:
1. do a query like
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
        <Attribute name = "gene_stable_id" />
        <Attribute name = "gene_flank" />
        <Filter name = "downstream_flank" value = "200"/>
        <Filter name = "upstream_flank" value = "1000"/>
        <Filter name = "transcript_status" value = "KNOWN"/>
        <Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
but it only returns the 1 kbase of upstream sequence, and doesn't go
beyond the TSS, as I would expect.
2. do a query like
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
        <Attribute name = "gene_stable_id" />
        <Attribute name = "coding_gene_flank" />
        <Filter name = "upstream_flank" value = "1000"/>
        <Filter name = "transcript_status" value = "KNOWN"/>
        <Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
but as shown earlier in this email, this way I may get too much
kilobases of sequences, which is not what I want.
3. issue two queries for each gene, like:
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
        <Attribute name = "gene_stable_id" />
        <Attribute name = "gene_flank" />
        <Filter name = "upstream_flank" value = "1000"/>
        <Filter name = "transcript_status" value = "KNOWN"/>
        <Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
and
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
        <Attribute name = "gene_stable_id" />
        <Attribute name = "5utr" />
        <Filter name = "transcript_status" value = "KNOWN"/>
        <Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
So far this third approach looks promising, but I didn't yet try it.

Is this last method the right way to do what I need? Or there's a
different (better) way?


Thanks beforehand for your replies,

--
Sincerely yours,
Bogdan Tokovenko,
PhD student at the Laboratory of Protein Biosynthesis,
Department of Genetic Information Translation Mechanisms,
Institute of Molecular Biology and Genetics, Kyiv, Ukraine
http://bogdan.org.ua/

Reply via email to