Hi Marten, It turns out I didn't have the whole story! The problem is that knownGeneMrna and kgTargetAli are not what they appear to be.
The knownGeneMrna table contains, for each UCSC Gene, a representative mRNA sequence that most closely matches the gene prediction generated by the UCSC Genes pipeline. It is chosen after the fact, and we do not actually align these to the genome. The kgTargetAli table is in PSL format, but it is not the result of an actual alignment with blat. It is a "fake PSL" that is made from the predicted genomic exons. The "query sequences" (which would usually be in a fasta file that goes with the PSL) do not actually exist anywhere. We could generate such a fasta file for you, though. It would consist of mRNA predictions based on the genomic sequence and the gene model. If you would like that, please let us know. I apologize for not getting you the correct information about these tables sooner! They have apparently been a source of much confusion in the past, too, and this is something that we would like to deal with better in the next UCSC Genes build. -- Brooke Rhead UCSC Genome Bioinformatics Group On 02/10/11 05:26, Marten Jäger wrote: > Hi. > > You're right. It's seems to get more and more confusing. > >> Hi Marten, >> >> I think I've somehow made this more confusing than it should be! Let >> me start by answering your most recent questions: >> >>> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and >>> I assume that the peptide sequences stored in the knownGeneMrna are >>> taken from RefSeq/GenBank. >> >> Right. The whole process is described on the UCSC Genes track details >> page. One way to see that is to go to the Table Browser >> (http://genome.ucsc.edu/cgi-bin/hgTables), select the UCSC Genes >> track, and hit the "describe table schema" button. You will also be >> able to see a list of the tables related to the knownGene table. > > Okay, the Table Browser->"describe table schema" start point for data > research. I read the descriptions and had a look on referenced tables. > >> >>> Is there a table where I an find the information from the BLAT >>> alignment (missmatches,indels,...)? >> >> Yes, the kgTargetAli table, which is in PSL format (and PSL is the >> alignment format that is output by BLAT). > > Maybe I am wrong or does not understand the format of this table but > for me it seems that the kgTargetAli table is incomplete or wrong?! For > hg19/mm9 all counts for mismatches in the alignments of the > RefSeq and GenBank RNAs (is this correct? I assume from > knownGenePipeline step 1.) to the chromosomes is '0' on the other hand > the count for matches is the exact number as the complete jointed > sequence of all exons. > > > > >> >> Maybe you can clarify again what it is you are trying to do. Do you >> want chromosomal/genomic sequence for each UCSC Gene? Or are you >> trying to get mRNA sequence? >> >> If it is the former, you can do it quite easily with the Table Browser >> by selecting the UCSC Genes track, then "output format: sequence," and >> then choose "genomic" on the next page. There are options to retrieve >> sequence for only the exons. (There is no such option for the mRNA or >> protein sequence.) >> >> Let us know what you are trying to accomplish and what your >> outstanding questions are, and I or someone else on the team can try >> to help. > > So my intention was to predict sequence motifs on mRNA sequences. To > reduce redundancy I assumed it would be good to do this on exon level, > since selections of exons of one gene are reassembled to various > transcripts by alternative splicing and I am especially interested in > motifs spanning the exon-exon junction. Therefore I build up a database > which stores the exon sequences (and their links to the transcripts). To > validate my scripts I assembled the transcript sequences, by translating > the chromosomal sequence into mRNA, and compared them to those in > knownGeneMrna. Here I run in the problem that 1/4-1/5 of the assembled > mRNA sequences does not match the sequences in knownGeneMrna. So I > started to check manually where the difference are and run into various > samples (disregarding poly-A tails). > I asked and you mentioned that the alignments can be found in the > kgTargetAli file. Unfortunately I could not find informations in the > table to clarify these questions. > > To come back to my examples: > > deletion: uc008whh.1 > > knownGeneMrna: ...tttctgtttttttttttttttttttttt-aacctagaatct... > assembled exons: ...tttctgttttttttttttttttttttttTaacctagaatct... > > I found this line > 612 2520 0 0 0 0 0 4 5429 - uc008whh.1 > 2520 0 2520 chr5 152537259 3639968 3647917 5 > 1612,184,60,126,538, 0,1612,1796,1856,1982, > 3639968,3643557,3644880,3646783,3647379, > > but would expect something like: > 612 2520 0 0 0 1 1 4 5429 - uc008whh.1 > 2520 0 2520 chr5 152537259 3639968 3647917 5 > 1612,184,60,126,538, 0,1612,1796,1856,1982, > 3639968,3643557,3644880,3646783,3647379, > > > substitution: uc008wki.1 > > knownGeneMrna: ...cctcctAtactggagct... > assembled exons: ...cctcctGtactggagct... > > > kgTargetAli: > 649 3707 0 0 0 0 0 12 33434 + > uc008wki.1 3707 0 3707 chr5 ... > > expect: > 649 3706 1 0 0 0 0 12 33434 + > uc008wki.1 3707 0 3707 chr5 ... > > > > various:uc008wii.1 > > > kgTargetAli: > 9 4509 0 0 0 0 0 14 571956 - uc008wii.1 > 4509 0 4509 chr5 ... > > > expect: > 9 4509 4 0 0 4 8 16 571958 - uc008wii.1 > 4509 0 4509 chr5 ... qStarts should also start with13, ... > > > alignment: > >_ 4529 nt vs. > >_ 4509 nt > scoring matrix: , gap penalties: -12/-2 > 99.4% identity; Global alignment score: 17865 > > 10 20 30 40 50 60 > 649550 AATTCGGCACGAGCGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA > ::::::::::::::::::::::::::::::::::::::::::::::: > _ -------------CGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA > 10 20 30 40 > > ... > > 730 740 750 760 770 780 > 649550 CGTGCACACTGATTTATGTCAGTACATGGAACAGCACCCTGGAGGACTCCATCCAGATAA > :::::::::::::::::::::::::::::: :::::::::::::::::::::::::::: > _ CGTGCACACTGATTTATGTCAGTACATGGACAAGCACCCTGGAGGACTCCATCCAGATAA > 710 720 730 740 750 760 > > ... > > 1750 1760 1770 1780 1790 1800 > 649550 AAGAACTACGTTACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT > ::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: > _ AAGAACTACGTGACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT > 1730 1740 1750 1760 1770 1780 > > ... > > 1990 2000 2010 2020 2030 2040 > 649550 TGTTTTCCCTGAGAGCAGAGTGCATTCTGCAACCTCCAGGGAAGAACATTCTTTTTGCTA > :::::::::::::::::: ::::::::::::::::::::: ::::::::::::::::::: > _ TGTTTTCCCTGAGAGCAGGGTGCATTCTGCAACCTCCAGG-AAGAACATTCTTTTTGCTA > 1970 1980 1990 2000 2010 2020 > > ... > > 2470 2480 2490 2500 2510 2520 > 649550 GAAAAAAAAAAATCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT > ::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: > _ GAAAAAAAAAA-TCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT > 2450 2460 2470 2480 2490 2500 > > 2530 2540 2550 2560 2570 2580 > 649550 TAGAAATCACATTTTATAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC > :::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::: > _ TAGAAATCACATTTTAAAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC > 2510 2520 2530 2540 2550 2560 > > 2590 2600 2610 2620 2630 2640 > 649550 TTTGAATTTAATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC > ::::: :::::::::::::::::::::::::::::::::::::::::::::::::: > _ TTTGA-----ATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC > 2570 2580 2590 2600 2610 2620 > > ... > > 3490 3500 3510 3520 3530 > 649550 GAAGATTATGTTTGT-TTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGGG > ::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::: > _ GAAGATTATGTTTGTATTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGG- > 3470 3480 3490 3500 3510 > > ... > > 3720 3730 3740 3750 3760 3770 > 649550 TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTTTGCTCTCCTGGCCT > :::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::: > _ TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTT-GCTCTCCTGGCCT > 3700 3710 3720 3730 3740 3750 > > ... > > 4020 4030 4040 4050 4060 4070 > 649550 ATTAAATACAACATCCATGGGACAGGAAA-TGTGTTTGCTATAAAATTAGAGATATAAGG > ::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: > _ ATTAAATACAACATCCATGGGACAGGAAAATGTGTTTGCTATAAAATTAGAGATATAAGG > 4000 4010 4020 4030 4040 4050 > > ... > > > > Is it correct that small indels and mismatches in the query are not > reported by the PSL format? > > > Maybe a workaround would be to use the knownGeneMra sequences. However > this way I would need the start/end positions of the exons in the query > sequences/mRNAs which does not match these in knownGene or kgTargetAli. > (seeuc008wii.1) > > Any suggestions? > > > Thanks. > > Marten > > > >> >> -- >> Brooke Rhead >> UCSC Genome Bioinformatics Group >> >> >> >> On 02/09/11 02:39, Marten Jäger wrote: >>> Hi. >>> >>> I am told that the given example was a bad choice (since the poly-A >>> tail is not encoded in the chromosomal sequence). Nonetheless there >>> are better examples: >>> >>> uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509 >>> knownGeneMrna sequence length: 4529 >>> >>> >>> uc008wjb.1 - kgTargetAli & knownGene assembled exon length: 1208 >>> knownGeneMrna sequence length: 1210 >>> >>> For both examples there seem to be index errors for the exon starts >>> and or stops coordinates...? >>> >>> uc008whh.1 - there is a single 't' missing in the knownGeneMrna >>> sequence (1. exon) in comparison to the chromosomal sequence. >>> >>> There are a lot of examples where the sequences only differ in SNPs >>> or micro indels. >>> >>> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and >>> I assume that the peptide sequences stored in the knownGeneMrna are >>> taken from RefSeq/GenBank. >>> Is there a table where I an find the information from the BLAT >>> alignment (missmatches,indels,...)? >>> >>> >>> Marten >>> >>> >>>> Hi Brooke, >>>> >>>> >>>>> Hi Marten, >>>>> >>>>> So, for each known gene, you want to generate a sequence that >>>>> consists of only the exons, correct? >>>> >>>> That's correct, I need the mRNA sequence. >>>> >>>>> There is not enough information to do it with knownGene.txt, as you >>>>> pointed out, because the coordinates listed are only for the >>>>> genome, and tell you nothing about the coordinates of the mRNA. >>>> >>>> Why not? I can use the strand information and exonStarts/exonEnds >>>> chromosomal coordinates to get the exon sequences from chr?.fa for >>>> each known gene. >>>> >>>>> >>>>> Instead you could use kgTargetAli. It gives information about the >>>>> alignment of the mRNA to the genome, and it is in psl format: >>>>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2 >>>> >>>> I think I can completely reconstruct the data by using the >>>> knownGene.txt. >>>> >>>> bin - of no interest >>>> matches - this is the sum of knownGene: exonEnds-exonStarts >>>> misMatches - this is always '0' at least for mm9,hg19 >>>> repMatches - '' >>>> nCount - '' >>>> qNumInsert - '' >>>> qBaseInsert - '' >>>> tNumInsert - number of introns in between the exons (number of >>>> knownGene: exonEnds/exonStarts-1) >>>> tBaseInsert - length of the introns (tNumInsert) - difference >>>> between knownGene: exonEnds(n) & exonStarts(n+1) >>>> strand - knownGene: strand >>>> qName - knownGene: name >>>> qSize - same as matches >>>> qStart -this is always '0' at least for mm9,hg19 >>>> qEnd - same as matches >>>> tName - knownGene: chrom >>>> tSize - of no interest >>>> tStart - knownGene: txStart >>>> tEnd - knownGene: txEnd >>>> blockCount - knownGene: exonCount >>>> blockSizes -knownGene: exonEnds-exonStarts >>>> qStarts - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1 >>>> tStarts - knownGene: exonStarts >>>> >>>> >>>> So you see there is no more information (w/o tSize) stored in the >>>> kgTargetAli file than in knownGene. >>>> >>>>> >>>>> You could use the qStart and qEnd fields to get the start and end >>>>> positions of the parts of each mRNA that aligned. >>>> >>>> As mentions above this is the same information I can reconstruct >>>> from knownGene. I still have the problem that I can't reconstruct >>>> the exact sequence as stored in the knownGeneMrna file. >>>> >>>> Coming back to my example 'c008wkk.1' >>>> >>>> The entry in kgTargetAli is: >>>> 81 3675 0 0 0 0 0 9 128942 - >>>> uc008wkk.1 3675 0 3675 chr5 152537259 8490335 >>>> 8622952 10 2254,122,158,169,81,90,86,134,116,465, >>>> 0,2254,2376,2534,2703,2784,2874,2960,3094,3210, >>>> 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487, >>>> >>>> >>>> >>>> I can generate the mRNA sequence using knownGene with a size of 3675 >>>> bases. On the other hand the sequences in knownGeneMrna has 3700 >>>> bases (the poly-A tail). >>>> >>>> So maybe you know where I can find the additional information to >>>> generate the exact sequences as in knownGeneMrna or are they not >>>> stored somewhere in the UCSC database? >>>> >>>> >>>> Thanks a lot. >>>> >>>> Marten >>>> >>>> >>>> >>>>> >>>>> -- >>>>> Brooke Rhead >>>>> UCSC Genome Bioinformatics Group >>>>> >>>>> >>>>> >>>>> On 02/08/11 03:47, Marten Jäger wrote: >>>>>> Hi. >>>>>> >>>>>> Thanks Brooke for your answer and illustrations. With the given >>>>>> links I known understand the problem I run in. >>>>>> >>>>>> My intention was to reduce data redundancy and run the motif >>>>>> search genome wide only on the exons and assemble the data >>>>>> afterwards for each known gene, transcript, ... >>>>>> As far as I now understand this not possible. On the other hand >>>>>> it's not possible the reproduce the exons from knownGeneMrna.txt >>>>>> since the exon start / end indices (--> length) from knownGene.txt >>>>>> in 1/4-1/5 of the data not match or SNP could not be considered. >>>>>> Any suggestions? Maybe I should abandon the idea of data reduction. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Marten >>>>>> >>>>>>> Hi Marten, >>>>>>> >>>>>>> The differences you are seeing are definitely expected. >>>>>>> >>>>>>> The sequence found at >>>>>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is >>>>>>> the mouse reference genome sequence, and it came from sequencing >>>>>>> mouse DNA. The sequence in knownGeneMrna.txt is based mRNA and >>>>>>> protein sequence from several sources (click on the blue "UCSC >>>>>>> Genes" link on http://genome.ucsc.edu/cgi-bin/hgTracks to read >>>>>>> more about how this file was created). The knownGeneMrna >>>>>>> sequence is aligned to the genomic sequence using BLAT. The >>>>>>> single base differences are SNPs, and the different exon >>>>>>> start/end positions are a result of mRNA sequence not aligning to >>>>>>> the genome, for instance, when there is a polyA tail on the mRNA. >>>>>>> >>>>>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt >>>>>>> sequence rather than the genomic sequence. >>>>>>> >>>>>>> I hope this is helpful. If you have further questions, please >>>>>>> feel free to contact us again at [email protected]. >>>>>>> >>>>>>> -- >>>>>>> Brooke Rhead >>>>>>> UCSC Genome Bioinformatics Group >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 02/07/11 05:00, Marten Jäger wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I downloaded the chromosomal sequences >>>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) >>>>>>>> and the Database files >>>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for >>>>>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the >>>>>>>> chromosomal locations for the exons using knownGene.txt I >>>>>>>> extracted the mRNA Sequences for the knownGenes and compared >>>>>>>> them to the sequences in knownGeneMrna.txt. Unfortunately about >>>>>>>> 1/4 of the sequences differ in single nucleotide mutations >>>>>>>> >>>>>>>> substitution: uc008wki.1 >>>>>>>> >>>>>>>> ...cctcctAtactggagct... >>>>>>>> ...cctcctGtactggagct... >>>>>>>> >>>>>>>> or different exon start/end positions: >>>>>>>> >>>>>>>> start: uc008wjb.1 >>>>>>>> >>>>>>>> cggcgtgggactgggagtccgtcc... >>>>>>>> gcgtgggactgggagtccgtccgg... >>>>>>>> >>>>>>>> end: uc008wkk.1 >>>>>>>> >>>>>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa >>>>>>>> ...gatttttttaaccata >>>>>>>> >>>>>>>> >>>>>>>> Can anyone please explain these differences and/or give me a >>>>>>>> hint which data to use (I'm looking for motifs in the processed >>>>>>>> mRNA). >>>>>>>> >>>>>>>> Many Thanks. >>>>>>>> >>>>>>>> Marten >>>>>>>> >>>>>>>> >>>>>> >>>> >>> > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
