Hello Manisha

I can duplicate your result:

 >hg19_ensGene_ENST00000455845 range=chr2:82210396-82232824 5'pad=0 
3'pad=0 strand=- repeatMasking=none
CCATCTGGATGTATACCTGCAGGTCACAGGGGATATGATGGCTTAGCTTG
GGCTCAGAGGCTGGACAGGTTTATTCAAAATTATGAGGGATTCATTTAAT
ATCTGAATTGTTCAAGATAATGATTGGTGTGACAATGAGTGACCAAATTT
CTCAAAGTAAGATGAATTTTCCTGCAAGGATGAAAGCTTGGAATATGGCT
TCTGACCAAATACTGTGTCTAGAATTGGTGGGTTCTTGGTCTCCCTGACT
TAAAGAATGAAGCCGCAGACCCTCACAGTATTACAGTTCTTAAAGATGGT
GTGTCCCGAGTCTGTTCCTTCAGATGTTCAGATGTATCCACAGTTTCTTC
TTTCTGGTGGGTTCGTGGTCTTGCTGACTTCAGGAGTGAAGCTGCAGACC
TTCGTGTTTCACTTGTTGATGCCCCTCACAGCAACTGCCACATTGTAAAT
ATGGAATCTGATGTATGTTTGACTTCAGTCAAAAATGT

The range = the footprint of the 3' UTR aligned sequence along the 
reference genome (global stop<-start for minus strand alignment). This 
range includes exons + introns.

The fasta sequence represents the 3' UTR sequence only. Can be though of 
as coverage on the reference genome by 3' UTR exons, only.

These two values not expected to be the same length, unless there was 
only one exon for the entire 3'UTR.

Hopefully this helps to explain the data, but please let us know if you 
need more assistance,
Jennifer


---------------------------------
Jennifer Jackson
UCSC Genome Informatics Group
http://genome.ucsc.edu/

On 4/6/10 11:49 AM, Manisha Brahmachary wrote:
> Hi,
>
> I have a query about the 3'UTR ensembl sequences (hg19).
>
> The appears that the header information does not match with the actual length
> of the corresponding sequence.
>
> See the example below:
>
> SeqLength AdvertisedLength name advertisedLocation
> 488 22428 ENST00000455845 chr2:82210396-82232824
>
>
>  From the whole 3'UTR file I have 15416 sequences that show this discrepancy.
> I have attached a thread of emails below that you can refer to as to how I
> extracted the 3'UTR sequences from UCSC table.
>
> Thanks
>
> Manisha
>
>
>
>
>
> -----Original Message-----
> From: Jennifer Jackson [mailto:[email protected]]
> Sent: Tuesday, February 16, 2010 7:21 PM
> To: Manisha Brahmachary
> Cc: [email protected]
> Subject: Re: [Genome] Query about downloading 3'UTR sequence for ENSEMBL
>
> Hello,
>
> Yes, you are extracting the data from the Table browser correctly. It
> appears from examine the data at ENSEMBL, that the transcript data
> source UniProtKB/Swiss-Prot P27144 (KAD4_HUMAN) was very recently updated.
>
> Last modified February 9, 2010
>
> The new version of the transcript has a much longer 3' UTR. When I take
> the revised sequence and run a simple web BLAT, it aligns easily with
> 100% identity covering the same 3' UTR region as the currently existing
> transcript plus the extra data.
>
> Comparing to other datasets, RefSeq does not have this new variant. The
> human mRNA track has a single read that represents a portion of the
> extended UTR. Examining EST data, spliced ESTs to do not confirm the
> region but unspliced ESTs do with significant, overlapping tiling (but
> these cannot be stranded without a splice site, so it should be kept in
> mind that maybe there is another gene present on the minus strand, maybe
> even a pseudogene that lacks introns, the tiling being so complete is a
> bit suspicious). Sequence data from other species (other RefSeq, mRna,
> Est) suggest that there is evidence for some type of transcription in
> this region and it is often connected to the positive strand with splice
> sites. None of these are intron-free, as is reported in the ENSEMBL
> transcript. From examination of the Conservation data, the genomic is
> syntenically conserved at the genome level, from Chimp to mouse - and
> most mammals evolutionarily in between.
>
> The UCSC Genes track was revised last on 2009-10-08, therefore the
> extended ENSEMBL transcript was not considered. The extended 3' UTR does
> seem very likely to be a transcribed region of genome - perhaps the 3'
> UTR of this gene - perhaps extended through 2-3 exons. A solid,
> contiguous block of this length is possible, but does not quite fit with
> the other data. But we are looking at sequence evidence only, there may
> be more evidence based on laboratory results that are not apparent from
> this analysis perspective. Perhaps a review of the other evidence at
> ENSEMBL and keeping an eye on other datasets (in particular RefSeq) as
> this data is reviewed by other teams will help to determine/confirm what
> exactly this region represents.
>
> In summary, the new data may be a legitimate extended 3' UTR, perhaps
> multi-exon, or it may be represent confusion with a non-coding,
> unspliced, transcribed, gene/pseudogene on the minus strand. If you were
> able to correlate expression information with the region (using ENCODE
> or other microarray data) that may also provide some clues. I will leave
> that part of the analysis for you to explore.
>
> Hopefully this helps a bit,
>
> Jennifer
>
> ---------------------------------
> Jennifer Jackson
> UCSC Genome Bioinformatics Group
> http://genome.ucsc.edu/
>
> On 2/16/10 2:17 PM, Manisha Brahmachary wrote:
>> Hello,
>>
>>
>>
>> I have a query regarding downloading 3'UTR for ensembl genes for Homo
> sapien.
>>
>>
>>
>> I am trying to download 3'UTR for all genes of ensembl (hg19) for Human
>>
>>
>>
>>>  From the UCSC table I do the following:
>>
>>
>>
>> Clade: mammal: Genome: human assembly: GRCh37
>>
>> Group: Genes and gene Predictions tracks track: ensemble genes
>>
>> Table: ensGene
>>
>> Region:genome
>>
>> Output format: sequence
>>
>>
>>
>>>  From Ensembl Genes genomic Sequence browser
>>
>> Sequence Retreival Region Options
>>
>> I choose: 3'UTR exons
>>
>> One FASTA record per gene
>>
>>
>>
>> When I download the sequence and compare one FASTA sequence for gene
>> ENST00000327299 with the 3'UTR sequence of the same gene downloaded from
>> ensembl, I see the lengths are different. The UCSC sequence appears to a
>> subset of the ensemble downloaded 3'UTR sequence. (See below the two
>> sequences)
>>
>>
>>
>> QUESTION:  1. Am I doing the steps right to download the entire 3'UTR
>> sequence from UCSC table or am I just downloading a part of the 3'UTR
> region?
>>
>>
>>
>>
>>
>>
>>
>> See below:
>>
>>
>>
>> FROM UCSC:
>>
>>> hg19_ensGene_ENST00000327299 range=chr1:65691861-65693173 5'pad=0 3'pad=0
>> strand=+ repeatMasking=none
>>
>> CCCTGCCCAATGGAAGAACCAGGAAGATGTGGTCATTCATTCAATAGTGT
>>
>> GTGTAGTATTGGTGCTGTGTCCAAATTAGAAGCTAGCTGAGGTAGCTTGC
>>
>> AGCATCTTTTCTAGTTGAAATGGTGAACTGATAGGAAAACAAATGAGTAG
>>
>> AAAGAGTTCATGAAGAGGCCCTCCTCTGCCTTTCAAAAGGCTGGTCACCT
>>
>> ACACATGTTTAAGGTGTCTCTGCACATGTCTCAAGCCCATCACAAGAAAG
>>
>> CAAGTACAGTGTGGATTTCAAATGGTGTGTAACTTCAGCTCCAGCTGGTT
>>
>> TTTGACAGCTGTTGCTGTGGTAATATTTTTGACATGTGATGGTGATAGTC
>>
>> TCTGGTTCTCCCCATCCCCACAAAGGCTGTTGAACCACAGCACCAGGAAG
>>
>> CCTGAGAATGAATCCTGAGGGCTCTAGCCCAGGCTTTGTCCCAGGCTTTC
>>
>> TGGTGTGTGCCCTCCTGGTAACAGTGAAATTGAAGCTACTTACTCATAGT
>>
>> GGTTGTTTCTCTGGTCTTGAGTGACTGTGTCCACAGTTCATTTTTTTCCG
>>
>> GTAGGAATAACTCCTTTTCTACATCCACGCTCCATAGAGTCTCTCCTTTT
>>
>> CAGACATCCTGGGATGAAAGAATTTGGCTTTTTTTTTTCTTTTTTTTTTT
>>
>> GGACATCTGTTTTCACTCTTAGGCTTTTAAACAATAGTTATTGCTTTTAT
>>
>> CCCTCTCAGATTCTAATAACTGAGAGCGATGGGGCTATATTGAATCTCTG
>>
>> TATGCACTGAGAACTGAGCTATGAAGAGGATCTTATTAAACTGCTGGTCT
>>
>> GACTTTATGGATTGACACTGTTCCTTTCTTTTATTGTGAAAAAAAAAAAA
>>
>> AACCCTGAAAGTCTTGGGAACCCCCTAAAGTCTTTTGGGAATCCTCAAAA
>>
>> AGCATGGGAAGTTAAGTATTTAGCTACATAAATGTTGTAAGATCATATCT
>>
>> TATGTATAGAAGTAATAAGACCATTTGGAATTACTGGACTAATTGAATAG
>>
>> TTAAGGTTTCTATTCGGGACAATAAAATGTATTTTGAAAGTGCTGCTAAC
>>
>> TATTGATGCTGACAGTGTTTCACTCCTATGAGTGACCCAAACATATTATA
>>
>> AATATGTGGTAAAGGGAATGGAGCCTGTGGGGTTGAGCAGAATGTTGTAC
>>
>> TAGCTGTGCCTGGACTGAGTATAACAGCTTTATGATTATGAGAAAACAAA
>>
>> TTCTTTATTTTTTTTTTCTGTTCCAAAGATTCATCCTATGGGGTGGCCAT
>>
>> AAAGTCTAGAATTAGATACTAATATTTTGTCATTCATTATAACATATCAA
>>
>> TAAACCATTTGTT
>>
>>
>>
>> FROM ENSEMBL
>>
>>
>>
>>> ENSG00000162433|ENST00000327299
>>
>> CCCTGCCCAATGGAAGAACCAGGAAGATGTGGTCATTCATTCAATAGTGTGTGTAGTATT
>>
>> GGTGCTGTGTCCAAATTAGAAGCTAGCTGAGGTAGCTTGCAGCATCTTTTCTAGTTGAAA
>>
>> TGGTGAACTGATAGGAAAACAAATGAGTAGAAAGAGTTCATGAAGAGGCCCTCCTCTGCC
>>
>> TTTCAAAAGGCTGGTCACCTACACATGTTTAAGGTGTCTCTGCACATGTCTCAAGCCCAT
>>
>> CACAAGAAAGCAAGTACAGTGTGGATTTCAAATGGTGTGTAACTTCAGCTCCAGCTGGTT
>>
>> TTTGACAGCTGTTGCTGTGGTAATATTTTTGACATGTGATGGTGATAGTCTCTGGTTCTC
>>
>> CCCATCCCCACAAAGGCTGTTGAACCACAGCACCAGGAAGCCTGAGAATGAATCCTGAGG
>>
>> GCTCTAGCCCAGGCTTTGTCCCAGGCTTTCTGGTGTGTGCCCTCCTGGTAACAGTGAAAT
>>
>> TGAAGCTACTTACTCATAGTGGTTGTTTCTCTGGTCTTGAGTGACTGTGTCCACAGTTCA
>>
>> TTTTTTTCCGGTAGGAATAACTCCTTTTCTACATCCACGCTCCATAGAGTCTCTCCTTTT
>>
>> CAGACATCCTGGGATGAAAGAATTTGGCTTTTTTTTTTCTTTTTTTTTTTGGACATCTGT
>>
>> TTTCACTCTTAGGCTTTTAAACAATAGTTATTGCTTTTATCCCTCTCAGATTCTAATAAC
>>
>> TGAGAGCGATGGGGCTATATTGAATCTCTGTATGCACTGAGAACTGAGCTATGAAGAGGA
>>
>> TCTTATTAAACTGCTGGTCTGACTTTATGGATTGACACTGTTCCTTTCTTTTATTGTGAA
>>
>> AAAAAAAAAAAACCCTGAAAGTCTTGGGAACCCCCTAAAGTCTTTTGGGAATCCTCAAAA
>>
>> AGCATGGGAAGTTAAGTATTTAGCTACATAAATGTTGTAAGATCATATCTTATGTATAGA
>>
>> AGTAATAAGACCATTTGGAATTACTGGACTAATTGAATAGTTAAGGTTTCTATTCGGGAC
>>
>> AATAAAATGTATTTTGAAAGTGCTGCTAACTATTGATGCTGACAGTGTTTCACTCCTATG
>>
>> AGTGACCCAAACATATTATAAATATGTGGTAAAGGGAATGGAGCCTGTGGGGTTGAGCAG
>>
>> AATGTTGTACTAGCTGTGCCTGGACTGAGTATAACAGCTTTATGATTATGAGAAAACAAA
>>
>> TTCTTTATTTTTTTTTTCTGTTCCAAAGATTCATCCTATGGGGTGGCCATAAAGTCTAGA
>>
>> ATTAGATACTAATATTTTGTCATTCATTATAACATATCAATAAACCATTTGTTAAAAGAT
>>
>> TTGCCTGGTTTCCAGACTTGGTGGCCACCTTGAATAATTCTTGCTGTCTTCTGGGAAGGA
>>
>> TGATGAAATTTATTCCTGCTGCCTTAAAAATATGTATCCCTTCTTCACCCATCATGACTG
>>
>> TCCCCAGTGAGTGTCCTTTACTATTCTTGGGAGTGACTCCTGTCTAACTTTTCATACTGG
>>
>> CGAGAAGAAAAGAAGCCTATTTTAACACTTTAGTGGTGTTGAAACACATTACTTACTTTC
>>
>> TGAAGATGTCCCAGTGAATCCTCTGTCAATTCACTGCCATATGTAATCTATATGATAAGG
>>
>> AATGCATCTTCCTTCTAAGTACTGCCCAAACTCTTGCCAGCTCCTCTCCCATTGTCCCTT
>>
>> CATGTGAATATTTCTTGGCTACCTTAGTGGAAATATAGATCAGTTTTCTCCCCATCCATC
>>
>> CTCTCAAACATAATGAGATTGTTTACTTTTTAGATTTATGCAGTGAAAATGCCCAGTCAG
>>
>> GTCTGAATCGTCAGTGCATTATATTGACTCTGAGCACTTTAGAATTTAGAGTTGCAATTG
>>
>> AATGCCAGCTGTGGAGATGGGGTGCATATCAGATATATAAATAAAGCTCAGGTTTGCTAG
>>
>> GGAACCAGGTATAGAGAAAAATAAGTCTGATATGAGGAAAATTGCACAATTTAGAGTAGT
>>
>> TATGCCGTAGAGAAAATTTCCACAAACTAGGAAATGTAGAGAGTTATTCTATAGAATACT
>>
>> CAAAAGAGGAAAGTATGTGATTTTTGGAAACAGGAAAATCTTCAAACTTCTTTCTTCACT
>>
>> TCCCTTTGTGTTTAGCTGACCCTCCAATGTGATCATTGCCTTTGGAGTTTGGGAGAGGTA
>>
>> CGGGAAGTGGCCTGATCCCTGCTTCCATACTTCACTCCTCCATCCATCCTTCCCTCCCTC
>>
>> TTCCCCTCCAGCTAAATGGACAATTCTAGCCAACATTGAGTCACTCAATAAGTCTCAACA
>>
>> GTGGGTGTGTTTGCTGAGATTGTCCAGCGGTTGAGCAGTTTGGTCTCACCTCCCTCGCTA
>>
>> GTTGAGACCAAAAAGAGACAAATAACTTTTTCATGGTCTTTGAAACATAATGCTTATTTC
>>
>> GTGGTCAATGGCTTTAAAAAAATCTGTTTCTTGTTTTCTTCAACAAACTCACTAGTTTTC
>>
>> CCTTAAATGATATTGTAAAAATTAAAGTAATCTTGAAAATGTTTTGACAAAAGTAAAATT
>>
>> AAAGGGACATCTTTTCTTGTTTTGTTTTTTTTTTTTCTATTGCCACACATGACCGTTCCT
>>
>> TCACCTTTAAGCAAAGAGAGTGGTTCAGATGGTTTCTAAGATGCCAACCTGACCTCGCAT
>>
>> TCTGTCATTCTACCCAGCTCTTAATTCAATTTGCTTCCATTATCCTAACAGGCTTCTTTC
>>
>> TTACTTAGAACTTGGAAAGGCTGCTGTATTTAATACCCTCCAACACTAACGCAGACTTAA
>>
>> GATAGGTACTGTTTATTGAAAACCTACTGAGTGAAATGTGCGGTTTTAGGACCTTCATAA
>>
>> ACATCTCATTTAATCTTTCTAGCATCCTGTGAAACAGCCATGATTTCACGTTGATAAACA
>>
>> AAGAAGACAGGGGTCCCAGGGATGTGAAGCATCTTGCCCAGGCTTCTGCTGCTGGTGACC
>>
>> AGTGTAGCCAGGACTCCAGCCCAGGTTTTCCTGACTCAGAAGACTGAGCTTTTTCCTGGA
>>
>> TGTTATTAATAGCTAATTGTGTCCAAGCAACCAAGGGCCTTGAGTCTGCTTGGTTCTGCT
>>
>> TATGGCCTCACATCAAGAAATGGAGCTAGTCCATGTCTGTAGTCCCAATGCTTTGGGAAG
>>
>> CCATGATGGGAAGGTTGTCGGAGGCCAAAAGTTCAAGACCAGGCTGGGCAATATCACAAG
>>
>> ACTCCATCTCTACGGAAAAGTAAAAAATTAGCCAGTCATGGTGGTGTACACTTATGGTCC
>>
>> TAGTTACTCAGGAGACTTAGGCAGGAGGATTGCTTGATCCTAGGAATTCGAGGCTGCAGT
>>
>> GAGCTATGATTGCACCTCTGCACCCAAGCCTGGGCGACACAGCGAGACCCTCTCTCTTAA
>>
>> AAAAAAAAAATAGCAGAGCTCACCAAAGTGATGTTCACCTTTTTATGACATTCCTTTTTC
>>
>> TTAGCTTAAGAAAAGAAAGCTGCTAGATGAGAGTCTTAGTTTTCCTGCATAAGACCTCCT
>>
>> TTATGAATAGAATAAAAGACTGTCAAAGTAGGCTGGGCTTGGGCCCAGGCTAATCTATGA
>>
>> AGGAAGCAAGCTCGTGTTCCTTACCTATCCTTTTGGTGTCCATTGGATTGTGCCCCGAAG
>>
>> TGGCCTTTACCCTTGAGCCGTCCCCAGCCATGGTGCTCACACATAGGCTTTTGAGCTCCT
>>
>> TGGAGCTATCCAGATCCTGCTCACTTTTCCTTCCTGAGATCAGAACAAATCACCCCCTTA
>>
>> CTCCCACTCCAAACAAGGCCTTGATGATAAACTAATCCTTCCTAAAATGCTGGTAGGTAA
>>
>> ACAAGCAATGATGAAGCATTGAACACAGGTTAACTCCTGACTTTTGTACCATTGTCTATT
>>
>> CCATTACACATTAACATGACTCTGAATGCCAGATCCAAACCTTTGCCCACCATCTGCTTG
>>
>> TCGTGCAACAGTTGAGGCAGTAACCAGGGGAGATTCACTTCCTGTCTTGTCCTTCCCCAG
>>
>> GGATCACCCCCCTGCTGCCCTCTAGCAGCCAAACTCAGATGAGTTCCATTGTTACCCTAG
>>
>> GTGTGCCCATCTCTTTGGTAGGGAAGGAGAAAGGTAAGAATAGCCATCAGTGAGGAAGGA
>>
>> TTCTTGGAGCGAGGAGCCACTGTGGTTTTTCCTGCTATTTAAGATGTTGAGACCGGATAA
>>
>> CTTTAGAAAGATACCTGCACAAACCCATAAATAGTGCTTTTATAAAGTTTAGTTCACCGG
>>
>> AACCTGAGTTCAGTATTTGACATTAGCTTTTTGTCCAAAGAGTTGAAGCCTGCTGGAGGT
>>
>> CTTTGCTCAAATAATAAATACCACATATTTCCAAGTGTGTTCAGGTATAGGCACTAGGTA
>>
>> CTGTCTGTTTACTTCATGTTAGGCACATTACATGCATTGGCTAATCAAATCCTCATCAAT
>>
>> TACATATGTAATAATCTAAACTTGCCTCCTTGTATTATAAATGGAAATAATCCTGTTTAT
>>
>> TTAAACGGGTTTTCATGTACCTGTAGGGATTAGGAAACTCAAATGGCCTTTTTAATACCT
>>
>> TTCCCTAGTTTGAGCTCCCTGTTCTCTTTAACAGATAAAACAACATATTTGCTTCAGCCT
>>
>> GGAATCTGTTTTTGGTGCTTTGGTGCAGAGACAGGAAATGGGCACTCAGAGTCACACTGG
>>
>> TAGTTGCACACTGTATCTACAGAGGGCGTGTCTCATCTGTACTCTGCTGGGTTACAGGAT
>>
>> TTCAGTAGGTATTTGTGTCCACCTGAGAATTCTGTTTATTACCTTTCATTTGACAGTGTC
>>
>> TTTCCTTTCTGCAGTTGATTTTGCTAGAGAGGCAATTCATAAGGTGAGGTCCTGTTCATA
>>
>> GTATGACTTGCTTTCTCAATATCTCCTTCAATTTTTAGTAACTCTTGGTCTATTTGGTGT
>>
>> CTTTAAAAAAAATAACCTAGTAATAAAGACTTCTTTTAATGTGGAAATGTGGTCTGGTAG
>>
>> TAAGTTATTTCTTTCCACATGTAACTGACCCAATCTGGTTTCCAAATGAGAAGTGTGCAG
>>
>> GCCCCAGAGGTTGAGAAGCCATATTTCAACTGTGAAAAAAATCTGCTTCCTGCATCTGTT
>>
>> GAAATATAGTTGTTCATACTTGCCATCCCTTATCTTTCTTGTAACAATTTGCACAGTTCT
>>
>> TGCCAGAATAAATGCCATTATCTGTATGTTTCAGGGAGTTCCCCAATTTGATCATTTTTG
>>
>> TGTGTGTGTGGTGTGTGTGTGAGAGAGAGAGATACTGCAGTAAAACATTTCTAAAGGATG
>>
>> AAAGCTCTTGTATGGCATAGATATGAATTCCTTCCTCTGGTAATAATTAGGTTATTCCCA
>>
>> GAAGCACAGTGTCATTCTTTAAATAAAAGCTTTCCTGTTTAAAGCTTTTCAAAGGAGCAG
>>
>> ACCACCTTGAAGATTCCCCCTAGGGTTGATATGTGTCTAATTCATTTTATAAAAATTATT
>>
>> CTTGTCTTCATTTTAAAGCTTTGGCTATATAGTCAGAAATGTCCTAAATAACAAACTATT
>>
>> TTGTATTTAATTTAGGGAAGACTAAAGGGAAGAAAAATGAAAACTCAGTCTTTATGTAAG
>>
>> CTCCAAGGATATTAGGGCTTAAAGGGCTTTTCTAGTTTTATGAGAATTTGTACTACTGAT
>>
>> TTTTATATATTCCTGTTTTTGAGATGAACAGATCTCTGGGGAAATTGTTGAGTTACAATG
>>
>> GCATTTCACTGTGATCCCTCTCAAGCTCAGATCAGTTCTATAACCCAATGACAACCTGTC
>>
>> TCTTTGGTTTACTGTCCTGTGAAATGTCAGCTCAAGTTTCCCAGAAGTCGTGTGTTTATG
>>
>> ATGAGTCAGAGTGCTTTTCCTCGGTGGGACAGTTGCTGGCCCTCTTAATTTTGGTGTATG
>>
>> TGCTTCCAAGTATCTAAACCTCCAGTCTGATCTGTATATGCTATCCTAACTGTTAATTGT
>>
>> ATTATTGATTATGTTGATTATCTTGCTTGAAGGTTCATACTTTTCAATTTGATAGAAATA
>>
>> AAGTTTTTTTCTGCTTATA
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Genome maillist  -  [email protected]
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to