Re: [Genome] Differences in UCSC DB and chr?.fa files

Brooke Rhead Tue, 08 Feb 2011 12:14:48 -0800

Hi Marten,

So, for each known gene, you want to generate a sequence that consists 
of only the exons, correct?  There is not enough information to do it 
with knownGene.txt, as you pointed out, because the coordinates listed 
are only for the genome, and tell you nothing about the coordinates of 
the mRNA.


Instead you could use kgTargetAli.  It gives information about the 
alignment of the mRNA to the genome, and it is in psl format:
http://genome.ucsc.edu/FAQ/FAQformat.html#format2

You could use the qStart and qEnd fields to get the start and end 
positions of the parts of each mRNA that aligned.

--
Brooke Rhead
UCSC Genome Bioinformatics Group



On 02/08/11 03:47, Marten Jäger wrote:
> Hi.
> 
> Thanks Brooke for your answer and illustrations. With the given links I 
> known understand the problem I run in.
> 
> My intention was to reduce data redundancy and run the motif search 
> genome wide only on the exons and assemble the data afterwards for each 
> known gene, transcript, ...
> As far as I now understand this not possible. On the other hand it's not 
> possible the reproduce the exons from knownGeneMrna.txt since the exon 
> start / end indices (--> length) from knownGene.txt in 1/4-1/5 of the 
> data not match or SNP could not be considered. Any suggestions? Maybe I 
> should abandon the idea of data reduction.
> 
> Thanks.
> 
> Marten
> 
>> Hi Marten,
>>
>> The differences you are seeing are definitely expected.
>>
>> The sequence found at 
>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the 
>> mouse reference genome sequence, and it came from sequencing mouse 
>> DNA.  The sequence in knownGeneMrna.txt is based mRNA and protein 
>> sequence from several sources (click on the blue "UCSC Genes" link on 
>> http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how this 
>> file was created).  The knownGeneMrna sequence is aligned to the 
>> genomic sequence using BLAT.  The single base differences are SNPs, 
>> and the different exon start/end positions are a result of mRNA 
>> sequence not aligning to the genome, for instance, when there is a 
>> polyA tail on the mRNA.
>>
>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt 
>> sequence rather than the genomic sequence.
>>
>> I hope this is helpful.  If you have further questions, please feel 
>> free to contact us again at [email protected].
>>
>> -- 
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>>
>>
>> On 02/07/11 05:00, Marten Jäger wrote:
>>> Hi,
>>>
>>> I downloaded the chromosomal sequences 
>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and 
>>> the Database files 
>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for 
>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the chromosomal 
>>> locations for the exons using knownGene.txt I extracted the mRNA 
>>> Sequences for the knownGenes and compared them to the sequences in 
>>> knownGeneMrna.txt. Unfortunately about 1/4 of the sequences differ in 
>>> single nucleotide mutations
>>>
>>> substitution: uc008wki.1
>>>
>>> ...cctcctAtactggagct...
>>> ...cctcctGtactggagct...
>>>
>>> or different exon start/end positions:
>>>
>>> start: uc008wjb.1
>>>
>>> cggcgtgggactgggagtccgtcc...
>>>    gcgtgggactgggagtccgtccgg...
>>>
>>> end: uc008wkk.1
>>>
>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa
>>> ...gatttttttaaccata
>>>
>>>
>>> Can anyone please explain these differences and/or give me a hint 
>>> which data to use (I'm looking for motifs in the processed mRNA).
>>>
>>> Many Thanks.
>>>
>>> Marten
>>>
>>>
> 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Differences in UCSC DB and chr?.fa files

Reply via email to