I'm using the mart API to download summaries of human exon data. In tsv format everything is clear; one row per exon. I also want the exon sequences, doing the same query, with an added gene_exon attribute, selecting FASTA instead of TSV (aided by the martview webite to generate the script).
Intuitively, I would expect the same number of results in each (having selected remove duplicate rows in both cases). However, I get TSV rows: 532103 Fasta sequences: 141274 Secondly, I believed that I could use the Fasta data alone because the header contains much of the exon metadata. I'm not so sure after looking more closely. The header seems to be ambiguous when an exon is shared between transcripts. My dataset "hsapiens_gene_ensembl"; My query attributes (the same for both TSV and Fasta queries) qw(chromosome_name ensembl_gene_id ensembl_transcript_id start_position end_position transcript_start transcript_end strand transcript_count ensembl_exon_id exon_chrom_start exon_chrom_end rank phase) In addition I use gene_exon to obtain an exon sequence in the Fasta query. An example Fasta record: >2|ENSG00000163328|ENST00000295500;ENST00000392552;ENST00000392551|175004621|175060068|175004621;175007126|175060057;175060068|-1|3|ENSE00001073363|175038759|175038876|8;7 TCTATTGTCTGTGCTGGAATGATGATATGGAATTTTGTTAAAGAAAAAAATTTTGTTGGA This exon is shared between transcripts ENST00000295500;ENST00000392552;ENST00000392551 However, the transcript starts/ends are reported twice 175004621;175007126|175060057;175060068 Presumably this is because two of them share a start/end? But which coordinates belong to which? Likewise with other fields e.g. the exon ranks are reported 8;7 In fact the exon is in the 8th position in ENST00000392551 which is listed last in the transcript stable ID triple, so these values aren't even in the same order with respect to each other. Is this a bug or am I misusing the API? I've looked in the manual and the mailing list. I read some threads relating to the Fasta header, but didn't spot anything on this issue specifically. thanks, Keith -- - Keith James - Wellcome Trust Sanger Institute, UK - -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
