Re: [mart-dev] Interpreting exons wrt transcripts

Rhoda Kinsella Wed, 04 Mar 2009 07:32:04 -0800

Hi Keith,

Please see below for my response:


I'm using the mart API to download summaries of human exon data. In
tsv format everything is clear; one row per exon. I also want the exon
sequences, doing the same query, with an added gene_exon attribute,
selecting FASTA instead of TSV (aided by the martview webite to
generate the script).

Intuitively, I would expect the same number of results in each (having
selected remove duplicate rows in both cases). However, I get

TSV rows: 532103
Fasta sequences: 141274

Firstly, I think your FASTA file has been truncated as you do not getthe total number of possible exon sequences (297956 for release 52).I would suggest that you download again but select a compressed.gzfile and see if you get the correct count.

The reason for the difference in numbers is that the TSV file containsthe same exons over and over if they arein multiple transcripts. For each row of the table, there are such alarge number of combinations of attributes that

each row will be unique even if the exon ID is in multiple rows.

In the sequence search, you are requesting just the unique exonsequence, so you will only get the sequence (and corresponding header)once for each exon ID.I hope that makes sense, but if I am not being clear please get backto me.

Secondly, I believed that I could use the Fasta data alone because the
header contains much of the exon metadata. I'm not so sure after
looking more closely. The header seems to be ambiguous when an exon is
shared between transcripts.

My dataset "hsapiens_gene_ensembl";

My query attributes (the same for both TSV and Fasta queries)

qw(chromosome_name
  ensembl_gene_id ensembl_transcript_id
  start_position end_position
  transcript_start transcript_end strand transcript_count
  ensembl_exon_id exon_chrom_start exon_chrom_end
  rank phase)

In addition I use gene_exon to obtain an exon sequence in the Fastaquery.


An example Fasta record:

2|ENSG00000163328|ENST00000295500;ENST00000392552;ENST00000392551|175004621|175060068|175004621;175007126|175060057;175060068|-1|3|ENSE00001073363|175038759|175038876|8;7

TCTATTGTCTGTGCTGGAATGATGATATGGAATTTTGTTAAAGAAAAAAATTTTGTTGGA

This exon is shared between transcripts
ENST00000295500;ENST00000392552;ENST00000392551

However, the transcript starts/ends are reported twice
175004621;175007126|175060057;175060068

Presumably this is because two of them share a start/end? But which
coordinates belong to which?

These results show that there are two different starts and ends forthese three exons (and as you mention, two transcripts have the samestart and end)Start1; Start2|End1; End2 (with Start1 and End1 being start/end ofone or more transcripts and Start2 and End2 two being start/end ofanother transcript(s))I agree that this is confusing and this issue with the ordering ofselected attributes and results has been mentioned by users.We will try to address these issues over the coming months. For themoment, I would suggest that you take the start/end results from theTSV file,

where they are more meaningful.



Likewise with other fields e.g. the exon ranks are reported
8;7

This is because this exon can be ranked as exon 7 in some of thetranscripts for this gene and ranked as exon 8 in other transcriptsfor this gene.



In fact the exon is in the 8th position in ENST00000392551 which is
listed last in the transcript stable ID triple, so these values aren't
even in the same order with respect to each other.

The number 8 in the first position means that it is the 8th exon inthis transcript. The ;7 means it is ranked 7th in other transcriptsfor the same gene (i.e. ENST00000392552)

I hope that helps,
Regards,
Rhoda



Is this a bug or am I misusing the API? I've looked in the manual and
the mailing list. I read some threads relating to the Fasta header,
but didn't spot anything on this issue specifically.

thanks,

Keith

--

- Keith James - Wellcome Trust Sanger Institute, UK -


--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.


Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.

Re: [mart-dev] Interpreting exons wrt transcripts

Reply via email to