Hi Marten, I think I've somehow made this more confusing than it should be! Let me start by answering your most recent questions:
> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and > I assume that the peptide sequences stored in the knownGeneMrna are > taken from RefSeq/GenBank. Right. The whole process is described on the UCSC Genes track details page. One way to see that is to go to the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables), select the UCSC Genes track, and hit the "describe table schema" button. You will also be able to see a list of the tables related to the knownGene table. > Is there a table where I an find the information from the BLAT > alignment (missmatches,indels,...)? Yes, the kgTargetAli table, which is in PSL format (and PSL is the alignment format that is output by BLAT). Maybe you can clarify again what it is you are trying to do. Do you want chromosomal/genomic sequence for each UCSC Gene? Or are you trying to get mRNA sequence? If it is the former, you can do it quite easily with the Table Browser by selecting the UCSC Genes track, then "output format: sequence," and then choose "genomic" on the next page. There are options to retrieve sequence for only the exons. (There is no such option for the mRNA or protein sequence.) Let us know what you are trying to accomplish and what your outstanding questions are, and I or someone else on the team can try to help. -- Brooke Rhead UCSC Genome Bioinformatics Group On 02/09/11 02:39, Marten Jäger wrote: > Hi. > > I am told that the given example was a bad choice (since the poly-A tail > is not encoded in the chromosomal sequence). Nonetheless there are > better examples: > > uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509 > knownGeneMrna sequence length: 4529 > > > uc008wjb.1 - kgTargetAli & knownGene assembled exon length: 1208 > knownGeneMrna sequence length: 1210 > > For both examples there seem to be index errors for the exon starts and > or stops coordinates...? > > uc008whh.1 - there is a single 't' missing in the knownGeneMrna sequence > (1. exon) in comparison to the chromosomal sequence. > > There are a lot of examples where the sequences only differ in SNPs or > micro indels. > > Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and I > assume that the peptide sequences stored in the knownGeneMrna are taken > from RefSeq/GenBank. > Is there a table where I an find the information from the BLAT alignment > (missmatches,indels,...)? > > > Marten > > >> Hi Brooke, >> >> >>> Hi Marten, >>> >>> So, for each known gene, you want to generate a sequence that >>> consists of only the exons, correct? >> >> That's correct, I need the mRNA sequence. >> >>> There is not enough information to do it with knownGene.txt, as you >>> pointed out, because the coordinates listed are only for the genome, >>> and tell you nothing about the coordinates of the mRNA. >> >> Why not? I can use the strand information and exonStarts/exonEnds >> chromosomal coordinates to get the exon sequences from chr?.fa for >> each known gene. >> >>> >>> Instead you could use kgTargetAli. It gives information about the >>> alignment of the mRNA to the genome, and it is in psl format: >>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2 >> >> I think I can completely reconstruct the data by using the knownGene.txt. >> >> bin - of no interest >> matches - this is the sum of knownGene: exonEnds-exonStarts >> misMatches - this is always '0' at least for mm9,hg19 >> repMatches - '' >> nCount - '' >> qNumInsert - '' >> qBaseInsert - '' >> tNumInsert - number of introns in between the exons (number of >> knownGene: exonEnds/exonStarts-1) >> tBaseInsert - length of the introns (tNumInsert) - difference >> between knownGene: exonEnds(n) & exonStarts(n+1) >> strand - knownGene: strand >> qName - knownGene: name >> qSize - same as matches >> qStart -this is always '0' at least for mm9,hg19 >> qEnd - same as matches >> tName - knownGene: chrom >> tSize - of no interest >> tStart - knownGene: txStart >> tEnd - knownGene: txEnd >> blockCount - knownGene: exonCount >> blockSizes -knownGene: exonEnds-exonStarts >> qStarts - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1 >> tStarts - knownGene: exonStarts >> >> >> So you see there is no more information (w/o tSize) stored in the >> kgTargetAli file than in knownGene. >> >>> >>> You could use the qStart and qEnd fields to get the start and end >>> positions of the parts of each mRNA that aligned. >> >> As mentions above this is the same information I can reconstruct from >> knownGene. I still have the problem that I can't reconstruct the exact >> sequence as stored in the knownGeneMrna file. >> >> Coming back to my example 'c008wkk.1' >> >> The entry in kgTargetAli is: >> 81 3675 0 0 0 0 0 9 128942 - >> uc008wkk.1 3675 0 3675 chr5 152537259 8490335 >> 8622952 10 2254,122,158,169,81,90,86,134,116,465, >> 0,2254,2376,2534,2703,2784,2874,2960,3094,3210, >> 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487, >> >> >> >> I can generate the mRNA sequence using knownGene with a size of 3675 >> bases. On the other hand the sequences in knownGeneMrna has 3700 bases >> (the poly-A tail). >> >> So maybe you know where I can find the additional information to >> generate the exact sequences as in knownGeneMrna or are they not >> stored somewhere in the UCSC database? >> >> >> Thanks a lot. >> >> Marten >> >> >> >>> >>> -- >>> Brooke Rhead >>> UCSC Genome Bioinformatics Group >>> >>> >>> >>> On 02/08/11 03:47, Marten Jäger wrote: >>>> Hi. >>>> >>>> Thanks Brooke for your answer and illustrations. With the given >>>> links I known understand the problem I run in. >>>> >>>> My intention was to reduce data redundancy and run the motif search >>>> genome wide only on the exons and assemble the data afterwards for >>>> each known gene, transcript, ... >>>> As far as I now understand this not possible. On the other hand it's >>>> not possible the reproduce the exons from knownGeneMrna.txt since >>>> the exon start / end indices (--> length) from knownGene.txt in >>>> 1/4-1/5 of the data not match or SNP could not be considered. Any >>>> suggestions? Maybe I should abandon the idea of data reduction. >>>> >>>> Thanks. >>>> >>>> Marten >>>> >>>>> Hi Marten, >>>>> >>>>> The differences you are seeing are definitely expected. >>>>> >>>>> The sequence found at >>>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the >>>>> mouse reference genome sequence, and it came from sequencing mouse >>>>> DNA. The sequence in knownGeneMrna.txt is based mRNA and protein >>>>> sequence from several sources (click on the blue "UCSC Genes" link >>>>> on http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how >>>>> this file was created). The knownGeneMrna sequence is aligned to >>>>> the genomic sequence using BLAT. The single base differences are >>>>> SNPs, and the different exon start/end positions are a result of >>>>> mRNA sequence not aligning to the genome, for instance, when there >>>>> is a polyA tail on the mRNA. >>>>> >>>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt >>>>> sequence rather than the genomic sequence. >>>>> >>>>> I hope this is helpful. If you have further questions, please feel >>>>> free to contact us again at [email protected]. >>>>> >>>>> -- >>>>> Brooke Rhead >>>>> UCSC Genome Bioinformatics Group >>>>> >>>>> >>>>> >>>>> >>>>> On 02/07/11 05:00, Marten Jäger wrote: >>>>>> Hi, >>>>>> >>>>>> I downloaded the chromosomal sequences >>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and >>>>>> the Database files >>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for >>>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the >>>>>> chromosomal locations for the exons using knownGene.txt I >>>>>> extracted the mRNA Sequences for the knownGenes and compared them >>>>>> to the sequences in knownGeneMrna.txt. Unfortunately about 1/4 of >>>>>> the sequences differ in single nucleotide mutations >>>>>> >>>>>> substitution: uc008wki.1 >>>>>> >>>>>> ...cctcctAtactggagct... >>>>>> ...cctcctGtactggagct... >>>>>> >>>>>> or different exon start/end positions: >>>>>> >>>>>> start: uc008wjb.1 >>>>>> >>>>>> cggcgtgggactgggagtccgtcc... >>>>>> gcgtgggactgggagtccgtccgg... >>>>>> >>>>>> end: uc008wkk.1 >>>>>> >>>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa >>>>>> ...gatttttttaaccata >>>>>> >>>>>> >>>>>> Can anyone please explain these differences and/or give me a hint >>>>>> which data to use (I'm looking for motifs in the processed mRNA). >>>>>> >>>>>> Many Thanks. >>>>>> >>>>>> Marten >>>>>> >>>>>> >>>> >> > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
