I'm using the mart API to download summaries of human exon data. In
tsv format everything is clear; one row per exon. I also want the exon
sequences, doing the same query, with an added gene_exon attribute,
selecting FASTA instead of TSV (aided by the martview webite to
generate the script).

Intuitively, I would expect the same number of results in each (having
selected remove duplicate rows in both cases). However, I get

TSV rows: 532103
Fasta sequences: 141274

Secondly, I believed that I could use the Fasta data alone because the
header contains much of the exon metadata. I'm not so sure after
looking more closely. The header seems to be ambiguous when an exon is
shared between transcripts.

My dataset "hsapiens_gene_ensembl"; 

My query attributes (the same for both TSV and Fasta queries)

qw(chromosome_name
   ensembl_gene_id ensembl_transcript_id
   start_position end_position
   transcript_start transcript_end strand transcript_count
   ensembl_exon_id exon_chrom_start exon_chrom_end
   rank phase)

In addition I use gene_exon to obtain an exon sequence in the Fasta query.

An example Fasta record:

>2|ENSG00000163328|ENST00000295500;ENST00000392552;ENST00000392551|175004621|175060068|175004621;175007126|175060057;175060068|-1|3|ENSE00001073363|175038759|175038876|8;7
TCTATTGTCTGTGCTGGAATGATGATATGGAATTTTGTTAAAGAAAAAAATTTTGTTGGA

This exon is shared between transcripts
ENST00000295500;ENST00000392552;ENST00000392551

However, the transcript starts/ends are reported twice
175004621;175007126|175060057;175060068

Presumably this is because two of them share a start/end? But which
coordinates belong to which?

Likewise with other fields e.g. the exon ranks are reported
8;7

In fact the exon is in the 8th position in ENST00000392551 which is
listed last in the transcript stable ID triple, so these values aren't
even in the same order with respect to each other.

Is this a bug or am I misusing the API? I've looked in the manual and
the mailing list. I read some threads relating to the Fasta header,
but didn't spot anything on this issue specifically.

thanks,

Keith

-- 

- Keith James - Wellcome Trust Sanger Institute, UK -


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

Reply via email to