Re: [Genome] Differences in UCSC DB and chr?.fa files

Marten Jäger Tue, 08 Feb 2011 03:49:06 -0800

Hi.

Thanks Brooke for your answer and illustrations. With the given links I 
known understand the problem I run in.


My intention was to reduce data redundancy and run the motif search 
genome wide only on the exons and assemble the data afterwards for each 
known gene, transcript, ...
As far as I now understand this not possible. On the other hand it's not 
possible the reproduce the exons from knownGeneMrna.txt since the exon 
start / end indices (--> length) from knownGene.txt in 1/4-1/5 of the 
data not match or SNP could not be considered. Any suggestions? Maybe I 
should abandon the idea of data reduction.

Thanks.

Marten

> Hi Marten,
>
> The differences you are seeing are definitely expected.
>
> The sequence found at 
> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the 
> mouse reference genome sequence, and it came from sequencing mouse 
> DNA.  The sequence in knownGeneMrna.txt is based mRNA and protein 
> sequence from several sources (click on the blue "UCSC Genes" link on 
> http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how this 
> file was created).  The knownGeneMrna sequence is aligned to the 
> genomic sequence using BLAT.  The single base differences are SNPs, 
> and the different exon start/end positions are a result of mRNA 
> sequence not aligning to the genome, for instance, when there is a 
> polyA tail on the mRNA.
>
> If you need mRNA sequence, I suggest using the knownGeneMrna.txt 
> sequence rather than the genomic sequence.
>
> I hope this is helpful.  If you have further questions, please feel 
> free to contact us again at [email protected].
>
> -- 
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
>
>
> On 02/07/11 05:00, Marten Jäger wrote:
>> Hi,
>>
>> I downloaded the chromosomal sequences 
>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and 
>> the Database files 
>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for 
>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the chromosomal 
>> locations for the exons using knownGene.txt I extracted the mRNA 
>> Sequences for the knownGenes and compared them to the sequences in 
>> knownGeneMrna.txt. Unfortunately about 1/4 of the sequences differ in 
>> single nucleotide mutations
>>
>> substitution: uc008wki.1
>>
>> ...cctcctAtactggagct...
>> ...cctcctGtactggagct...
>>
>> or different exon start/end positions:
>>
>> start: uc008wjb.1
>>
>> cggcgtgggactgggagtccgtcc...
>>    gcgtgggactgggagtccgtccgg...
>>
>> end: uc008wkk.1
>>
>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa
>> ...gatttttttaaccata
>>
>>
>> Can anyone please explain these differences and/or give me a hint 
>> which data to use (I'm looking for motifs in the processed mRNA).
>>
>> Many Thanks.
>>
>> Marten
>>
>>

-- 
Marten Jäger, Msc Bioinformatik
Charité - Universitätsmedizin Berlin
Campus Virchow Klinikum
Institut für Medizinische Genetik und Humangenetik
Augustenburger Platz 1
13353 Berlin
Germany
phone:  +49/30/450 569135
email:  [email protected]
http://genetik.charite.de/institut/
http://compbio.charite.de

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Differences in UCSC DB and chr?.fa files

Reply via email to