Hi Keith,
Please see below for my response:
I'm using the mart API to download summaries of human exon data. In
tsv format everything is clear; one row per exon. I also want the exon
sequences, doing the same query, with an added gene_exon attribute,
selecting FASTA instead of TSV (aided by the martview webite to
generate the script).
Intuitively, I would expect the same number of results in each (having
selected remove duplicate rows in both cases). However, I get
TSV rows: 532103
Fasta sequences: 141274
Firstly, I think your FASTA file has been truncated as you do not get
the total number of possible exon sequences (297956 for release 52).
I would suggest that you download again but select a compressed.gz
file and see if you get the correct count.
The reason for the difference in numbers is that the TSV file contains
the same exons over and over if they are
in multiple transcripts. For each row of the table, there are such a
large number of combinations of attributes that
each row will be unique even if the exon ID is in multiple rows.
In the sequence search, you are requesting just the unique exon
sequence, so you will only get the sequence (and corresponding header)
once for each exon ID.
I hope that makes sense, but if I am not being clear please get back
to me.
Secondly, I believed that I could use the Fasta data alone because the
header contains much of the exon metadata. I'm not so sure after
looking more closely. The header seems to be ambiguous when an exon is
shared between transcripts.
My dataset "hsapiens_gene_ensembl";
My query attributes (the same for both TSV and Fasta queries)
qw(chromosome_name
ensembl_gene_id ensembl_transcript_id
start_position end_position
transcript_start transcript_end strand transcript_count
ensembl_exon_id exon_chrom_start exon_chrom_end
rank phase)
In addition I use gene_exon to obtain an exon sequence in the Fasta
query.
An example Fasta record:
2|ENSG00000163328|ENST00000295500;ENST00000392552;ENST00000392551|
175004621|175060068|175004621;175007126|175060057;175060068|-1|3|
ENSE00001073363|175038759|175038876|8;7
TCTATTGTCTGTGCTGGAATGATGATATGGAATTTTGTTAAAGAAAAAAATTTTGTTGGA
This exon is shared between transcripts
ENST00000295500;ENST00000392552;ENST00000392551
However, the transcript starts/ends are reported twice
175004621;175007126|175060057;175060068
Presumably this is because two of them share a start/end? But which
coordinates belong to which?
These results show that there are two different starts and ends for
these three exons (and as you mention, two transcripts have the same
start and end)
Start1; Start2|End1; End2 (with Start1 and End1 being start/end of
one or more transcripts and Start2 and End2 two being start/end of
another transcript(s))
I agree that this is confusing and this issue with the ordering of
selected attributes and results has been mentioned by users.
We will try to address these issues over the coming months. For the
moment, I would suggest that you take the start/end results from the
TSV file,
where they are more meaningful.
Likewise with other fields e.g. the exon ranks are reported
8;7
This is because this exon can be ranked as exon 7 in some of the
transcripts for this gene and ranked as exon 8 in other transcripts
for this gene.
In fact the exon is in the 8th position in ENST00000392551 which is
listed last in the transcript stable ID triple, so these values aren't
even in the same order with respect to each other.
The number 8 in the first position means that it is the 8th exon in
this transcript. The ;7 means it is ranked 7th in other transcripts
for the same gene (i.e. ENST00000392552)
I hope that helps,
Regards,
Rhoda
Is this a bug or am I misusing the API? I've looked in the manual and
the mailing list. I read some threads relating to the Fasta header,
but didn't spot anything on this issue specifically.
thanks,
Keith
--
- Keith James - Wellcome Trust Sanger Institute, UK -
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.