Hi Brooke,
> Hi Marten, > > So, for each known gene, you want to generate a sequence that consists > of only the exons, correct? That's correct, I need the mRNA sequence. > There is not enough information to do it with knownGene.txt, as you > pointed out, because the coordinates listed are only for the genome, > and tell you nothing about the coordinates of the mRNA. Why not? I can use the strand information and exonStarts/exonEnds chromosomal coordinates to get the exon sequences from chr?.fa for each known gene. > > Instead you could use kgTargetAli. It gives information about the > alignment of the mRNA to the genome, and it is in psl format: > http://genome.ucsc.edu/FAQ/FAQformat.html#format2 I think I can completely reconstruct the data by using the knownGene.txt. bin - of no interest matches - this is the sum of knownGene: exonEnds-exonStarts misMatches - this is always '0' at least for mm9,hg19 repMatches - '' nCount - '' qNumInsert - '' qBaseInsert - '' tNumInsert - number of introns in between the exons (number of knownGene: exonEnds/exonStarts-1) tBaseInsert - length of the introns (tNumInsert) - difference between knownGene: exonEnds(n) & exonStarts(n+1) strand - knownGene: strand qName - knownGene: name qSize - same as matches qStart -this is always '0' at least for mm9,hg19 qEnd - same as matches tName - knownGene: chrom tSize - of no interest tStart - knownGene: txStart tEnd - knownGene: txEnd blockCount - knownGene: exonCount blockSizes -knownGene: exonEnds-exonStarts qStarts - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1 tStarts - knownGene: exonStarts So you see there is no more information (w/o tSize) stored in the kgTargetAli file than in knownGene. > > You could use the qStart and qEnd fields to get the start and end > positions of the parts of each mRNA that aligned. As mentions above this is the same information I can reconstruct from knownGene. I still have the problem that I can't reconstruct the exact sequence as stored in the knownGeneMrna file. Coming back to my example 'c008wkk.1' The entry in kgTargetAli is: 81 3675 0 0 0 0 0 9 128942 - uc008wkk.1 3675 0 3675 chr5 152537259 8490335 8622952 10 2254,122,158,169,81,90,86,134,116,465, 0,2254,2376,2534,2703,2784,2874,2960,3094,3210, 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487, I can generate the mRNA sequence using knownGene with a size of 3675 bases. On the other hand the sequences in knownGeneMrna has 3700 bases (the poly-A tail). So maybe you know where I can find the additional information to generate the exact sequences as in knownGeneMrna or are they not stored somewhere in the UCSC database? Thanks a lot. Marten > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > > On 02/08/11 03:47, Marten Jäger wrote: >> Hi. >> >> Thanks Brooke for your answer and illustrations. With the given links >> I known understand the problem I run in. >> >> My intention was to reduce data redundancy and run the motif search >> genome wide only on the exons and assemble the data afterwards for >> each known gene, transcript, ... >> As far as I now understand this not possible. On the other hand it's >> not possible the reproduce the exons from knownGeneMrna.txt since the >> exon start / end indices (--> length) from knownGene.txt in 1/4-1/5 >> of the data not match or SNP could not be considered. Any >> suggestions? Maybe I should abandon the idea of data reduction. >> >> Thanks. >> >> Marten >> >>> Hi Marten, >>> >>> The differences you are seeing are definitely expected. >>> >>> The sequence found at >>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the >>> mouse reference genome sequence, and it came from sequencing mouse >>> DNA. The sequence in knownGeneMrna.txt is based mRNA and protein >>> sequence from several sources (click on the blue "UCSC Genes" link >>> on http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how >>> this file was created). The knownGeneMrna sequence is aligned to >>> the genomic sequence using BLAT. The single base differences are >>> SNPs, and the different exon start/end positions are a result of >>> mRNA sequence not aligning to the genome, for instance, when there >>> is a polyA tail on the mRNA. >>> >>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt >>> sequence rather than the genomic sequence. >>> >>> I hope this is helpful. If you have further questions, please feel >>> free to contact us again at [email protected]. >>> >>> -- >>> Brooke Rhead >>> UCSC Genome Bioinformatics Group >>> >>> >>> >>> >>> On 02/07/11 05:00, Marten Jäger wrote: >>>> Hi, >>>> >>>> I downloaded the chromosomal sequences >>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and >>>> the Database files >>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for >>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the >>>> chromosomal locations for the exons using knownGene.txt I extracted >>>> the mRNA Sequences for the knownGenes and compared them to the >>>> sequences in knownGeneMrna.txt. Unfortunately about 1/4 of the >>>> sequences differ in single nucleotide mutations >>>> >>>> substitution: uc008wki.1 >>>> >>>> ...cctcctAtactggagct... >>>> ...cctcctGtactggagct... >>>> >>>> or different exon start/end positions: >>>> >>>> start: uc008wjb.1 >>>> >>>> cggcgtgggactgggagtccgtcc... >>>> gcgtgggactgggagtccgtccgg... >>>> >>>> end: uc008wkk.1 >>>> >>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa >>>> ...gatttttttaaccata >>>> >>>> >>>> Can anyone please explain these differences and/or give me a hint >>>> which data to use (I'm looking for motifs in the processed mRNA). >>>> >>>> Many Thanks. >>>> >>>> Marten >>>> >>>> >> -- Marten Jäger, Msc Bioinformatik Charité - Universitätsmedizin Berlin Campus Virchow Klinikum Institut für Medizinische Genetik und Humangenetik Augustenburger Platz 1 13353 Berlin Germany phone: +49/30/450 569135 email: [email protected] http://genetik.charite.de/institut/ http://compbio.charite.de _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
