Hi. You're right. It's seems to get more and more confusing.
> Hi Marten, > > I think I've somehow made this more confusing than it should be! Let > me start by answering your most recent questions: > >> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and >> I assume that the peptide sequences stored in the knownGeneMrna are >> taken from RefSeq/GenBank. > > Right. The whole process is described on the UCSC Genes track details > page. One way to see that is to go to the Table Browser > (http://genome.ucsc.edu/cgi-bin/hgTables), select the UCSC Genes > track, and hit the "describe table schema" button. You will also be > able to see a list of the tables related to the knownGene table. Okay, the Table Browser->"describe table schema" start point for data research. I read the descriptions and had a look on referenced tables. > >> Is there a table where I an find the information from the BLAT >> alignment (missmatches,indels,...)? > > Yes, the kgTargetAli table, which is in PSL format (and PSL is the > alignment format that is output by BLAT). Maybe I am wrong or does not understand the format of this table but for me it seems that the kgTargetAli table is incomplete or wrong?! For hg19/mm9 all counts for mismatches in the alignments of the RefSeq and GenBank RNAs (is this correct? I assume from knownGenePipeline step 1.) to the chromosomes is '0' on the other hand the count for matches is the exact number as the complete jointed sequence of all exons. > > Maybe you can clarify again what it is you are trying to do. Do you > want chromosomal/genomic sequence for each UCSC Gene? Or are you > trying to get mRNA sequence? > > If it is the former, you can do it quite easily with the Table Browser > by selecting the UCSC Genes track, then "output format: sequence," and > then choose "genomic" on the next page. There are options to retrieve > sequence for only the exons. (There is no such option for the mRNA or > protein sequence.) > > Let us know what you are trying to accomplish and what your > outstanding questions are, and I or someone else on the team can try > to help. So my intention was to predict sequence motifs on mRNA sequences. To reduce redundancy I assumed it would be good to do this on exon level, since selections of exons of one gene are reassembled to various transcripts by alternative splicing and I am especially interested in motifs spanning the exon-exon junction. Therefore I build up a database which stores the exon sequences (and their links to the transcripts). To validate my scripts I assembled the transcript sequences, by translating the chromosomal sequence into mRNA, and compared them to those in knownGeneMrna. Here I run in the problem that 1/4-1/5 of the assembled mRNA sequences does not match the sequences in knownGeneMrna. So I started to check manually where the difference are and run into various samples (disregarding poly-A tails). I asked and you mentioned that the alignments can be found in the kgTargetAli file. Unfortunately I could not find informations in the table to clarify these questions. To come back to my examples: deletion: uc008whh.1 knownGeneMrna: ...tttctgtttttttttttttttttttttt-aacctagaatct... assembled exons: ...tttctgttttttttttttttttttttttTaacctagaatct... I found this line 612 2520 0 0 0 0 0 4 5429 - uc008whh.1 2520 0 2520 chr5 152537259 3639968 3647917 5 1612,184,60,126,538, 0,1612,1796,1856,1982, 3639968,3643557,3644880,3646783,3647379, but would expect something like: 612 2520 0 0 0 1 1 4 5429 - uc008whh.1 2520 0 2520 chr5 152537259 3639968 3647917 5 1612,184,60,126,538, 0,1612,1796,1856,1982, 3639968,3643557,3644880,3646783,3647379, substitution: uc008wki.1 knownGeneMrna: ...cctcctAtactggagct... assembled exons: ...cctcctGtactggagct... kgTargetAli: 649 3707 0 0 0 0 0 12 33434 + uc008wki.1 3707 0 3707 chr5 ... expect: 649 3706 1 0 0 0 0 12 33434 + uc008wki.1 3707 0 3707 chr5 ... various:uc008wii.1 kgTargetAli: 9 4509 0 0 0 0 0 14 571956 - uc008wii.1 4509 0 4509 chr5 ... expect: 9 4509 4 0 0 4 8 16 571958 - uc008wii.1 4509 0 4509 chr5 ... qStarts should also start with13, ... alignment: >_ 4529 nt vs. >_ 4509 nt scoring matrix: , gap penalties: -12/-2 99.4% identity; Global alignment score: 17865 10 20 30 40 50 60 649550 AATTCGGCACGAGCGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA ::::::::::::::::::::::::::::::::::::::::::::::: _ -------------CGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA 10 20 30 40 ... 730 740 750 760 770 780 649550 CGTGCACACTGATTTATGTCAGTACATGGAACAGCACCCTGGAGGACTCCATCCAGATAA :::::::::::::::::::::::::::::: :::::::::::::::::::::::::::: _ CGTGCACACTGATTTATGTCAGTACATGGACAAGCACCCTGGAGGACTCCATCCAGATAA 710 720 730 740 750 760 ... 1750 1760 1770 1780 1790 1800 649550 AAGAACTACGTTACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT ::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: _ AAGAACTACGTGACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT 1730 1740 1750 1760 1770 1780 ... 1990 2000 2010 2020 2030 2040 649550 TGTTTTCCCTGAGAGCAGAGTGCATTCTGCAACCTCCAGGGAAGAACATTCTTTTTGCTA :::::::::::::::::: ::::::::::::::::::::: ::::::::::::::::::: _ TGTTTTCCCTGAGAGCAGGGTGCATTCTGCAACCTCCAGG-AAGAACATTCTTTTTGCTA 1970 1980 1990 2000 2010 2020 ... 2470 2480 2490 2500 2510 2520 649550 GAAAAAAAAAAATCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT ::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: _ GAAAAAAAAAA-TCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT 2450 2460 2470 2480 2490 2500 2530 2540 2550 2560 2570 2580 649550 TAGAAATCACATTTTATAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC :::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::: _ TAGAAATCACATTTTAAAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC 2510 2520 2530 2540 2550 2560 2590 2600 2610 2620 2630 2640 649550 TTTGAATTTAATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC ::::: :::::::::::::::::::::::::::::::::::::::::::::::::: _ TTTGA-----ATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC 2570 2580 2590 2600 2610 2620 ... 3490 3500 3510 3520 3530 649550 GAAGATTATGTTTGT-TTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGGG ::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::: _ GAAGATTATGTTTGTATTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGG- 3470 3480 3490 3500 3510 ... 3720 3730 3740 3750 3760 3770 649550 TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTTTGCTCTCCTGGCCT :::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::: _ TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTT-GCTCTCCTGGCCT 3700 3710 3720 3730 3740 3750 ... 4020 4030 4040 4050 4060 4070 649550 ATTAAATACAACATCCATGGGACAGGAAA-TGTGTTTGCTATAAAATTAGAGATATAAGG ::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: _ ATTAAATACAACATCCATGGGACAGGAAAATGTGTTTGCTATAAAATTAGAGATATAAGG 4000 4010 4020 4030 4040 4050 ... Is it correct that small indels and mismatches in the query are not reported by the PSL format? Maybe a workaround would be to use the knownGeneMra sequences. However this way I would need the start/end positions of the exons in the query sequences/mRNAs which does not match these in knownGene or kgTargetAli. (seeuc008wii.1) Any suggestions? Thanks. Marten > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > > On 02/09/11 02:39, Marten Jäger wrote: >> Hi. >> >> I am told that the given example was a bad choice (since the poly-A >> tail is not encoded in the chromosomal sequence). Nonetheless there >> are better examples: >> >> uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509 >> knownGeneMrna sequence length: 4529 >> >> >> uc008wjb.1 - kgTargetAli & knownGene assembled exon length: 1208 >> knownGeneMrna sequence length: 1210 >> >> For both examples there seem to be index errors for the exon starts >> and or stops coordinates...? >> >> uc008whh.1 - there is a single 't' missing in the knownGeneMrna >> sequence (1. exon) in comparison to the chromosomal sequence. >> >> There are a lot of examples where the sequences only differ in SNPs >> or micro indels. >> >> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and >> I assume that the peptide sequences stored in the knownGeneMrna are >> taken from RefSeq/GenBank. >> Is there a table where I an find the information from the BLAT >> alignment (missmatches,indels,...)? >> >> >> Marten >> >> >>> Hi Brooke, >>> >>> >>>> Hi Marten, >>>> >>>> So, for each known gene, you want to generate a sequence that >>>> consists of only the exons, correct? >>> >>> That's correct, I need the mRNA sequence. >>> >>>> There is not enough information to do it with knownGene.txt, as you >>>> pointed out, because the coordinates listed are only for the >>>> genome, and tell you nothing about the coordinates of the mRNA. >>> >>> Why not? I can use the strand information and exonStarts/exonEnds >>> chromosomal coordinates to get the exon sequences from chr?.fa for >>> each known gene. >>> >>>> >>>> Instead you could use kgTargetAli. It gives information about the >>>> alignment of the mRNA to the genome, and it is in psl format: >>>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2 >>> >>> I think I can completely reconstruct the data by using the >>> knownGene.txt. >>> >>> bin - of no interest >>> matches - this is the sum of knownGene: exonEnds-exonStarts >>> misMatches - this is always '0' at least for mm9,hg19 >>> repMatches - '' >>> nCount - '' >>> qNumInsert - '' >>> qBaseInsert - '' >>> tNumInsert - number of introns in between the exons (number of >>> knownGene: exonEnds/exonStarts-1) >>> tBaseInsert - length of the introns (tNumInsert) - difference >>> between knownGene: exonEnds(n) & exonStarts(n+1) >>> strand - knownGene: strand >>> qName - knownGene: name >>> qSize - same as matches >>> qStart -this is always '0' at least for mm9,hg19 >>> qEnd - same as matches >>> tName - knownGene: chrom >>> tSize - of no interest >>> tStart - knownGene: txStart >>> tEnd - knownGene: txEnd >>> blockCount - knownGene: exonCount >>> blockSizes -knownGene: exonEnds-exonStarts >>> qStarts - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1 >>> tStarts - knownGene: exonStarts >>> >>> >>> So you see there is no more information (w/o tSize) stored in the >>> kgTargetAli file than in knownGene. >>> >>>> >>>> You could use the qStart and qEnd fields to get the start and end >>>> positions of the parts of each mRNA that aligned. >>> >>> As mentions above this is the same information I can reconstruct >>> from knownGene. I still have the problem that I can't reconstruct >>> the exact sequence as stored in the knownGeneMrna file. >>> >>> Coming back to my example 'c008wkk.1' >>> >>> The entry in kgTargetAli is: >>> 81 3675 0 0 0 0 0 9 128942 - >>> uc008wkk.1 3675 0 3675 chr5 152537259 8490335 >>> 8622952 10 2254,122,158,169,81,90,86,134,116,465, >>> 0,2254,2376,2534,2703,2784,2874,2960,3094,3210, >>> 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487, >>> >>> >>> >>> I can generate the mRNA sequence using knownGene with a size of 3675 >>> bases. On the other hand the sequences in knownGeneMrna has 3700 >>> bases (the poly-A tail). >>> >>> So maybe you know where I can find the additional information to >>> generate the exact sequences as in knownGeneMrna or are they not >>> stored somewhere in the UCSC database? >>> >>> >>> Thanks a lot. >>> >>> Marten >>> >>> >>> >>>> >>>> -- >>>> Brooke Rhead >>>> UCSC Genome Bioinformatics Group >>>> >>>> >>>> >>>> On 02/08/11 03:47, Marten Jäger wrote: >>>>> Hi. >>>>> >>>>> Thanks Brooke for your answer and illustrations. With the given >>>>> links I known understand the problem I run in. >>>>> >>>>> My intention was to reduce data redundancy and run the motif >>>>> search genome wide only on the exons and assemble the data >>>>> afterwards for each known gene, transcript, ... >>>>> As far as I now understand this not possible. On the other hand >>>>> it's not possible the reproduce the exons from knownGeneMrna.txt >>>>> since the exon start / end indices (--> length) from knownGene.txt >>>>> in 1/4-1/5 of the data not match or SNP could not be considered. >>>>> Any suggestions? Maybe I should abandon the idea of data reduction. >>>>> >>>>> Thanks. >>>>> >>>>> Marten >>>>> >>>>>> Hi Marten, >>>>>> >>>>>> The differences you are seeing are definitely expected. >>>>>> >>>>>> The sequence found at >>>>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is >>>>>> the mouse reference genome sequence, and it came from sequencing >>>>>> mouse DNA. The sequence in knownGeneMrna.txt is based mRNA and >>>>>> protein sequence from several sources (click on the blue "UCSC >>>>>> Genes" link on http://genome.ucsc.edu/cgi-bin/hgTracks to read >>>>>> more about how this file was created). The knownGeneMrna >>>>>> sequence is aligned to the genomic sequence using BLAT. The >>>>>> single base differences are SNPs, and the different exon >>>>>> start/end positions are a result of mRNA sequence not aligning to >>>>>> the genome, for instance, when there is a polyA tail on the mRNA. >>>>>> >>>>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt >>>>>> sequence rather than the genomic sequence. >>>>>> >>>>>> I hope this is helpful. If you have further questions, please >>>>>> feel free to contact us again at [email protected]. >>>>>> >>>>>> -- >>>>>> Brooke Rhead >>>>>> UCSC Genome Bioinformatics Group >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 02/07/11 05:00, Marten Jäger wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I downloaded the chromosomal sequences >>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) >>>>>>> and the Database files >>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for >>>>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the >>>>>>> chromosomal locations for the exons using knownGene.txt I >>>>>>> extracted the mRNA Sequences for the knownGenes and compared >>>>>>> them to the sequences in knownGeneMrna.txt. Unfortunately about >>>>>>> 1/4 of the sequences differ in single nucleotide mutations >>>>>>> >>>>>>> substitution: uc008wki.1 >>>>>>> >>>>>>> ...cctcctAtactggagct... >>>>>>> ...cctcctGtactggagct... >>>>>>> >>>>>>> or different exon start/end positions: >>>>>>> >>>>>>> start: uc008wjb.1 >>>>>>> >>>>>>> cggcgtgggactgggagtccgtcc... >>>>>>> gcgtgggactgggagtccgtccgg... >>>>>>> >>>>>>> end: uc008wkk.1 >>>>>>> >>>>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa >>>>>>> ...gatttttttaaccata >>>>>>> >>>>>>> >>>>>>> Can anyone please explain these differences and/or give me a >>>>>>> hint which data to use (I'm looking for motifs in the processed >>>>>>> mRNA). >>>>>>> >>>>>>> Many Thanks. >>>>>>> >>>>>>> Marten >>>>>>> >>>>>>> >>>>> >>> >> -- Marten Jäger, Msc Bioinformatik Charité - Universitätsmedizin Berlin Campus Virchow Klinikum Institut für Medizinische Genetik und Humangenetik Augustenburger Platz 1 13353 Berlin Germany phone: +49/30/450 569135 email: [email protected] http://genetik.charite.de/institut/ http://compbio.charite.de _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
