Hi. Thanks Brooke for your answer and illustrations. With the given links I known understand the problem I run in.
My intention was to reduce data redundancy and run the motif search genome wide only on the exons and assemble the data afterwards for each known gene, transcript, ... As far as I now understand this not possible. On the other hand it's not possible the reproduce the exons from knownGeneMrna.txt since the exon start / end indices (--> length) from knownGene.txt in 1/4-1/5 of the data not match or SNP could not be considered. Any suggestions? Maybe I should abandon the idea of data reduction. Thanks. Marten > Hi Marten, > > The differences you are seeing are definitely expected. > > The sequence found at > ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the > mouse reference genome sequence, and it came from sequencing mouse > DNA. The sequence in knownGeneMrna.txt is based mRNA and protein > sequence from several sources (click on the blue "UCSC Genes" link on > http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how this > file was created). The knownGeneMrna sequence is aligned to the > genomic sequence using BLAT. The single base differences are SNPs, > and the different exon start/end positions are a result of mRNA > sequence not aligning to the genome, for instance, when there is a > polyA tail on the mRNA. > > If you need mRNA sequence, I suggest using the knownGeneMrna.txt > sequence rather than the genomic sequence. > > I hope this is helpful. If you have further questions, please feel > free to contact us again at [email protected]. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > > > On 02/07/11 05:00, Marten Jäger wrote: >> Hi, >> >> I downloaded the chromosomal sequences >> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and >> the Database files >> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for >> knownGene.txt and knownGeneMrna.txt from UCSC. Using the chromosomal >> locations for the exons using knownGene.txt I extracted the mRNA >> Sequences for the knownGenes and compared them to the sequences in >> knownGeneMrna.txt. Unfortunately about 1/4 of the sequences differ in >> single nucleotide mutations >> >> substitution: uc008wki.1 >> >> ...cctcctAtactggagct... >> ...cctcctGtactggagct... >> >> or different exon start/end positions: >> >> start: uc008wjb.1 >> >> cggcgtgggactgggagtccgtcc... >> gcgtgggactgggagtccgtccgg... >> >> end: uc008wkk.1 >> >> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa >> ...gatttttttaaccata >> >> >> Can anyone please explain these differences and/or give me a hint >> which data to use (I'm looking for motifs in the processed mRNA). >> >> Many Thanks. >> >> Marten >> >> -- Marten Jäger, Msc Bioinformatik Charité - Universitätsmedizin Berlin Campus Virchow Klinikum Institut für Medizinische Genetik und Humangenetik Augustenburger Platz 1 13353 Berlin Germany phone: +49/30/450 569135 email: [email protected] http://genetik.charite.de/institut/ http://compbio.charite.de _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
