Hi. I am told that the given example was a bad choice (since the poly-A tail is not encoded in the chromosomal sequence). Nonetheless there are better examples:
uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509 knownGeneMrna sequence length: 4529 uc008wjb.1 - kgTargetAli & knownGene assembled exon length: 1208 knownGeneMrna sequence length: 1210 For both examples there seem to be index errors for the exon starts and or stops coordinates...? uc008whh.1 - there is a single 't' missing in the knownGeneMrna sequence (1. exon) in comparison to the chromosomal sequence. There are a lot of examples where the sequences only differ in SNPs or micro indels. Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and I assume that the peptide sequences stored in the knownGeneMrna are taken from RefSeq/GenBank. Is there a table where I an find the information from the BLAT alignment (missmatches,indels,...)? Marten > Hi Brooke, > > >> Hi Marten, >> >> So, for each known gene, you want to generate a sequence that >> consists of only the exons, correct? > > That's correct, I need the mRNA sequence. > >> There is not enough information to do it with knownGene.txt, as you >> pointed out, because the coordinates listed are only for the genome, >> and tell you nothing about the coordinates of the mRNA. > > Why not? I can use the strand information and exonStarts/exonEnds > chromosomal coordinates to get the exon sequences from chr?.fa for > each known gene. > >> >> Instead you could use kgTargetAli. It gives information about the >> alignment of the mRNA to the genome, and it is in psl format: >> http://genome.ucsc.edu/FAQ/FAQformat.html#format2 > > I think I can completely reconstruct the data by using the knownGene.txt. > > bin - of no interest > matches - this is the sum of knownGene: exonEnds-exonStarts > misMatches - this is always '0' at least for mm9,hg19 > repMatches - '' > nCount - '' > qNumInsert - '' > qBaseInsert - '' > tNumInsert - number of introns in between the exons (number of > knownGene: exonEnds/exonStarts-1) > tBaseInsert - length of the introns (tNumInsert) - difference > between knownGene: exonEnds(n) & exonStarts(n+1) > strand - knownGene: strand > qName - knownGene: name > qSize - same as matches > qStart -this is always '0' at least for mm9,hg19 > qEnd - same as matches > tName - knownGene: chrom > tSize - of no interest > tStart - knownGene: txStart > tEnd - knownGene: txEnd > blockCount - knownGene: exonCount > blockSizes -knownGene: exonEnds-exonStarts > qStarts - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1 > tStarts - knownGene: exonStarts > > > So you see there is no more information (w/o tSize) stored in the > kgTargetAli file than in knownGene. > >> >> You could use the qStart and qEnd fields to get the start and end >> positions of the parts of each mRNA that aligned. > > As mentions above this is the same information I can reconstruct from > knownGene. I still have the problem that I can't reconstruct the exact > sequence as stored in the knownGeneMrna file. > > Coming back to my example 'c008wkk.1' > > The entry in kgTargetAli is: > 81 3675 0 0 0 0 0 9 128942 - > uc008wkk.1 3675 0 3675 chr5 152537259 8490335 > 8622952 10 2254,122,158,169,81,90,86,134,116,465, > 0,2254,2376,2534,2703,2784,2874,2960,3094,3210, > 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487, > > I can generate the mRNA sequence using knownGene with a size of 3675 > bases. On the other hand the sequences in knownGeneMrna has 3700 bases > (the poly-A tail). > > So maybe you know where I can find the additional information to > generate the exact sequences as in knownGeneMrna or are they not > stored somewhere in the UCSC database? > > > Thanks a lot. > > Marten > > > >> >> -- >> Brooke Rhead >> UCSC Genome Bioinformatics Group >> >> >> >> On 02/08/11 03:47, Marten Jäger wrote: >>> Hi. >>> >>> Thanks Brooke for your answer and illustrations. With the given >>> links I known understand the problem I run in. >>> >>> My intention was to reduce data redundancy and run the motif search >>> genome wide only on the exons and assemble the data afterwards for >>> each known gene, transcript, ... >>> As far as I now understand this not possible. On the other hand it's >>> not possible the reproduce the exons from knownGeneMrna.txt since >>> the exon start / end indices (--> length) from knownGene.txt in >>> 1/4-1/5 of the data not match or SNP could not be considered. Any >>> suggestions? Maybe I should abandon the idea of data reduction. >>> >>> Thanks. >>> >>> Marten >>> >>>> Hi Marten, >>>> >>>> The differences you are seeing are definitely expected. >>>> >>>> The sequence found at >>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the >>>> mouse reference genome sequence, and it came from sequencing mouse >>>> DNA. The sequence in knownGeneMrna.txt is based mRNA and protein >>>> sequence from several sources (click on the blue "UCSC Genes" link >>>> on http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how >>>> this file was created). The knownGeneMrna sequence is aligned to >>>> the genomic sequence using BLAT. The single base differences are >>>> SNPs, and the different exon start/end positions are a result of >>>> mRNA sequence not aligning to the genome, for instance, when there >>>> is a polyA tail on the mRNA. >>>> >>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt >>>> sequence rather than the genomic sequence. >>>> >>>> I hope this is helpful. If you have further questions, please feel >>>> free to contact us again at [email protected]. >>>> >>>> -- >>>> Brooke Rhead >>>> UCSC Genome Bioinformatics Group >>>> >>>> >>>> >>>> >>>> On 02/07/11 05:00, Marten Jäger wrote: >>>>> Hi, >>>>> >>>>> I downloaded the chromosomal sequences >>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and >>>>> the Database files >>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for >>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the >>>>> chromosomal locations for the exons using knownGene.txt I >>>>> extracted the mRNA Sequences for the knownGenes and compared them >>>>> to the sequences in knownGeneMrna.txt. Unfortunately about 1/4 of >>>>> the sequences differ in single nucleotide mutations >>>>> >>>>> substitution: uc008wki.1 >>>>> >>>>> ...cctcctAtactggagct... >>>>> ...cctcctGtactggagct... >>>>> >>>>> or different exon start/end positions: >>>>> >>>>> start: uc008wjb.1 >>>>> >>>>> cggcgtgggactgggagtccgtcc... >>>>> gcgtgggactgggagtccgtccgg... >>>>> >>>>> end: uc008wkk.1 >>>>> >>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa >>>>> ...gatttttttaaccata >>>>> >>>>> >>>>> Can anyone please explain these differences and/or give me a hint >>>>> which data to use (I'm looking for motifs in the processed mRNA). >>>>> >>>>> Many Thanks. >>>>> >>>>> Marten >>>>> >>>>> >>> > -- Marten Jäger, Msc Bioinformatik Charité - Universitätsmedizin Berlin Campus Virchow Klinikum Institut für Medizinische Genetik und Humangenetik Augustenburger Platz 1 13353 Berlin Germany phone: +49/30/450 569135 email: [email protected] http://genetik.charite.de/institut/ http://compbio.charite.de _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
