Hi Marten, So, for each known gene, you want to generate a sequence that consists of only the exons, correct? There is not enough information to do it with knownGene.txt, as you pointed out, because the coordinates listed are only for the genome, and tell you nothing about the coordinates of the mRNA.
Instead you could use kgTargetAli. It gives information about the alignment of the mRNA to the genome, and it is in psl format: http://genome.ucsc.edu/FAQ/FAQformat.html#format2 You could use the qStart and qEnd fields to get the start and end positions of the parts of each mRNA that aligned. -- Brooke Rhead UCSC Genome Bioinformatics Group On 02/08/11 03:47, Marten Jäger wrote: > Hi. > > Thanks Brooke for your answer and illustrations. With the given links I > known understand the problem I run in. > > My intention was to reduce data redundancy and run the motif search > genome wide only on the exons and assemble the data afterwards for each > known gene, transcript, ... > As far as I now understand this not possible. On the other hand it's not > possible the reproduce the exons from knownGeneMrna.txt since the exon > start / end indices (--> length) from knownGene.txt in 1/4-1/5 of the > data not match or SNP could not be considered. Any suggestions? Maybe I > should abandon the idea of data reduction. > > Thanks. > > Marten > >> Hi Marten, >> >> The differences you are seeing are definitely expected. >> >> The sequence found at >> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the >> mouse reference genome sequence, and it came from sequencing mouse >> DNA. The sequence in knownGeneMrna.txt is based mRNA and protein >> sequence from several sources (click on the blue "UCSC Genes" link on >> http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how this >> file was created). The knownGeneMrna sequence is aligned to the >> genomic sequence using BLAT. The single base differences are SNPs, >> and the different exon start/end positions are a result of mRNA >> sequence not aligning to the genome, for instance, when there is a >> polyA tail on the mRNA. >> >> If you need mRNA sequence, I suggest using the knownGeneMrna.txt >> sequence rather than the genomic sequence. >> >> I hope this is helpful. If you have further questions, please feel >> free to contact us again at [email protected]. >> >> -- >> Brooke Rhead >> UCSC Genome Bioinformatics Group >> >> >> >> >> On 02/07/11 05:00, Marten Jäger wrote: >>> Hi, >>> >>> I downloaded the chromosomal sequences >>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and >>> the Database files >>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for >>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the chromosomal >>> locations for the exons using knownGene.txt I extracted the mRNA >>> Sequences for the knownGenes and compared them to the sequences in >>> knownGeneMrna.txt. Unfortunately about 1/4 of the sequences differ in >>> single nucleotide mutations >>> >>> substitution: uc008wki.1 >>> >>> ...cctcctAtactggagct... >>> ...cctcctGtactggagct... >>> >>> or different exon start/end positions: >>> >>> start: uc008wjb.1 >>> >>> cggcgtgggactgggagtccgtcc... >>> gcgtgggactgggagtccgtccgg... >>> >>> end: uc008wkk.1 >>> >>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa >>> ...gatttttttaaccata >>> >>> >>> Can anyone please explain these differences and/or give me a hint >>> which data to use (I'm looking for motifs in the processed mRNA). >>> >>> Many Thanks. >>> >>> Marten >>> >>> > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
