Re: [Genome] Differences in UCSC DB and chr?.fa files

Brooke Rhead Mon, 14 Feb 2011 13:53:01 -0800

Hi Marten,

It turns out I didn't have the whole story! The problem is that 
knownGeneMrna and kgTargetAli are not what they appear to be.


The knownGeneMrna table contains, for each UCSC Gene, a representative 
mRNA sequence that most closely matches the gene prediction generated by 
the UCSC Genes pipeline. It is chosen after the fact, and we do not 
actually align these to the genome.

The kgTargetAli table is in PSL format, but it is not the result of an 
actual alignment with blat. It is a "fake PSL" that is made from the 
predicted genomic exons. The "query sequences" (which would usually be 
in a fasta file that goes with the PSL) do not actually exist anywhere.

We could generate such a fasta file for you, though. It would consist of 
mRNA predictions based on the genomic sequence and the gene model. If 
you would like that, please let us know.

I apologize for not getting you the correct information about these 
tables sooner! They have apparently been a source of much confusion in 
the past, too, and this is something that we would like to deal with 
better in the next UCSC Genes build.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 02/10/11 05:26, Marten Jäger wrote:
> Hi.
> 
> You're right. It's seems to get more and more confusing.
> 
>> Hi Marten,
>>
>> I think I've somehow made this more confusing than it should be!  Let 
>> me start by answering your most recent questions:
>>
>>> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and
>>> I assume that the peptide sequences stored in the knownGeneMrna are
>>> taken from RefSeq/GenBank.
>>
>> Right.  The whole process is described on the UCSC Genes track details 
>> page.  One way to see that is to go to the Table Browser 
>> (http://genome.ucsc.edu/cgi-bin/hgTables), select the UCSC Genes 
>> track, and hit the "describe table schema" button.  You will also be 
>> able to see a list of the tables related to the knownGene table.
> 
> Okay, the Table Browser->"describe table schema" start point for data 
> research. I read the descriptions and had a look on referenced tables.
> 
>>
>>> Is there a table where I an find the information from the BLAT
>>> alignment (missmatches,indels,...)?
>>
>> Yes, the kgTargetAli table, which is in PSL format (and PSL is the 
>> alignment format that is output by BLAT).
> 
> Maybe I am wrong or does not understand the format of this table  but 
> for me it seems that the kgTargetAli table is incomplete or wrong?! For 
> hg19/mm9 all counts for mismatches in the alignments of the
> RefSeq and GenBank RNAs (is this correct? I assume from 
> knownGenePipeline step 1.) to the chromosomes is '0' on the other hand 
> the count for matches is the exact number as the complete jointed 
> sequence of all exons.
> 
> 
> 
> 
>>
>> Maybe you can clarify again what it is you are trying to do.  Do you 
>> want chromosomal/genomic sequence for each UCSC Gene?  Or are you 
>> trying to get mRNA sequence?
>>
>> If it is the former, you can do it quite easily with the Table Browser 
>> by selecting the UCSC Genes track, then "output format: sequence," and 
>> then choose "genomic" on the next page.  There are options to retrieve 
>> sequence for only the exons.  (There is no such option for the mRNA or 
>> protein sequence.)
>>
>> Let us know what you are trying to accomplish and what your 
>> outstanding questions are, and I or someone else on the team can try 
>> to help.
> 
> So my intention was to predict sequence motifs on mRNA sequences. To 
> reduce redundancy I assumed it would be good to do this on exon level, 
> since selections of exons of one gene are reassembled to various 
> transcripts by alternative splicing and I am especially interested in 
> motifs spanning the exon-exon junction. Therefore I build up a database 
> which stores the exon sequences (and their links to the transcripts). To 
> validate my scripts I assembled the transcript sequences, by translating 
> the chromosomal sequence into mRNA, and compared them to those in 
> knownGeneMrna. Here I run in the problem that 1/4-1/5 of the assembled 
> mRNA sequences does not match the sequences in knownGeneMrna. So I 
> started to check manually where the difference are and run into various 
> samples (disregarding poly-A tails).
> I asked and you mentioned that the alignments can be found in the 
> kgTargetAli file. Unfortunately I could not find informations in the 
> table to clarify these questions.
> 
> To come back to my examples:
> 
> deletion: uc008whh.1
> 
> knownGeneMrna:   ...tttctgtttttttttttttttttttttt-aacctagaatct...
> assembled exons: ...tttctgttttttttttttttttttttttTaacctagaatct...
> 
> I found this line
> 612    2520    0    0    0    0    0    4    5429    -    uc008whh.1    
> 2520    0    2520    chr5    152537259    3639968    3647917    5    
> 1612,184,60,126,538,    0,1612,1796,1856,1982,    
> 3639968,3643557,3644880,3646783,3647379,
> 
> but would expect something like:
> 612    2520    0    0    0    1    1    4    5429    -    uc008whh.1    
> 2520    0    2520    chr5    152537259    3639968    3647917    5    
> 1612,184,60,126,538,    0,1612,1796,1856,1982,    
> 3639968,3643557,3644880,3646783,3647379,
> 
> 
> substitution: uc008wki.1
> 
> knownGeneMrna: ...cctcctAtactggagct...
> assembled exons: ...cctcctGtactggagct...
> 
> 
> kgTargetAli:
> 649    3707    0    0    0    0    0    12    33434    +    
> uc008wki.1    3707    0    3707    chr5    ...
> 
> expect:
> 649    3706    1    0    0    0    0    12    33434    +    
> uc008wki.1    3707    0    3707    chr5    ...
> 
> 
> 
> various:uc008wii.1
> 
> 
> kgTargetAli:
> 9    4509    0    0    0    0    0    14    571956    -    uc008wii.1    
> 4509    0    4509    chr5    ...
> 
> 
> expect:
> 9    4509    4    0    0    4    8    16    571958    -    uc008wii.1    
> 4509    0    4509    chr5    ... qStarts should also start with13, ...
> 
> 
> alignment:
>  >_                                                 4529 nt vs.
>  >_                                                 4509 nt
> scoring matrix: , gap penalties: -12/-2
> 99.4% identity;        Global alignment score: 17865
> 
>                10        20        30        40        50        60
> 649550 AATTCGGCACGAGCGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA
>                     :::::::::::::::::::::::::::::::::::::::::::::::
> _      -------------CGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA
>                             10        20        30        40
> 
>  ...
> 
>               730       740       750       760       770       780
> 649550 CGTGCACACTGATTTATGTCAGTACATGGAACAGCACCCTGGAGGACTCCATCCAGATAA
>        ::::::::::::::::::::::::::::::  ::::::::::::::::::::::::::::
> _      CGTGCACACTGATTTATGTCAGTACATGGACAAGCACCCTGGAGGACTCCATCCAGATAA
>        710       720       730       740       750       760
> 
> ...
> 
>              1750      1760      1770      1780      1790      1800
> 649550 AAGAACTACGTTACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT
>        ::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::::
> _      AAGAACTACGTGACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT
>       1730      1740      1750      1760      1770      1780
> 
> ...
> 
>              1990      2000      2010      2020      2030      2040
> 649550 TGTTTTCCCTGAGAGCAGAGTGCATTCTGCAACCTCCAGGGAAGAACATTCTTTTTGCTA
>        :::::::::::::::::: ::::::::::::::::::::: :::::::::::::::::::
> _      TGTTTTCCCTGAGAGCAGGGTGCATTCTGCAACCTCCAGG-AAGAACATTCTTTTTGCTA
>       1970      1980      1990      2000       2010      2020
> 
> ...
> 
>              2470      2480      2490      2500      2510      2520
> 649550 GAAAAAAAAAAATCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT
>        ::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::::
> _      GAAAAAAAAAA-TCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT
>        2450       2460      2470      2480      2490      2500
> 
>              2530      2540      2550      2560      2570      2580
> 649550 TAGAAATCACATTTTATAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC
>        :::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::
> _      TAGAAATCACATTTTAAAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC
>         2510      2520      2530      2540      2550      2560
> 
>              2590      2600      2610      2620      2630      2640
> 649550 TTTGAATTTAATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC
>        :::::     ::::::::::::::::::::::::::::::::::::::::::::::::::
> _      TTTGA-----ATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC
>         2570           2580      2590      2600      2610      2620
> 
> ...
> 
>              3490       3500      3510      3520      3530
> 649550 GAAGATTATGTTTGT-TTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGGG
>        ::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::
> _      GAAGATTATGTTTGTATTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGG-
>              3470      3480      3490      3500      3510
> 
> ...
> 
>     3720      3730      3740      3750      3760      3770
> 649550 TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTTTGCTCTCCTGGCCT
>        :::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::
> _      TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTT-GCTCTCCTGGCCT
>     3700      3710      3720      3730      3740       3750
> 
> ...
> 
>     4020      4030      4040       4050      4060      4070
> 649550 ATTAAATACAACATCCATGGGACAGGAAA-TGTGTTTGCTATAAAATTAGAGATATAAGG
>        ::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::
> _      ATTAAATACAACATCCATGGGACAGGAAAATGTGTTTGCTATAAAATTAGAGATATAAGG
>      4000      4010      4020      4030      4040      4050
> 
> ...
> 
> 
> 
> Is it correct that small indels and mismatches in the query are not 
> reported by the PSL format?
> 
> 
> Maybe a workaround would be to use the knownGeneMra sequences. However 
> this way I would need the start/end positions of the exons in the query 
> sequences/mRNAs which does not match these in knownGene or kgTargetAli. 
> (seeuc008wii.1)
> 
> Any suggestions?
> 
> 
> Thanks.
> 
> Marten
> 
> 
> 
>>
>> -- 
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>>
>> On 02/09/11 02:39, Marten Jäger wrote:
>>> Hi.
>>>
>>> I am told that the given example was a bad choice (since the poly-A 
>>> tail is not encoded in the chromosomal sequence).  Nonetheless there 
>>> are better examples:
>>>
>>> uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509    
>>> knownGeneMrna sequence length: 4529
>>>
>>>
>>> uc008wjb.1  - kgTargetAli & knownGene assembled exon length: 1208    
>>> knownGeneMrna sequence length: 1210
>>>
>>> For both examples there seem to be index errors for the exon starts 
>>> and or stops coordinates...?
>>>
>>> uc008whh.1 - there is a single 't' missing in the knownGeneMrna 
>>> sequence (1. exon) in comparison to the chromosomal sequence.
>>>
>>> There are a lot of examples where the sequences only differ in SNPs 
>>> or micro indels.
>>>
>>> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and 
>>> I assume that the peptide sequences stored in the knownGeneMrna are 
>>> taken from RefSeq/GenBank.
>>> Is there a table where I an find the information from the BLAT 
>>> alignment (missmatches,indels,...)?
>>>
>>>
>>> Marten
>>>
>>>
>>>> Hi Brooke,
>>>>
>>>>
>>>>> Hi Marten,
>>>>>
>>>>> So, for each known gene, you want to generate a sequence that 
>>>>> consists of only the exons, correct? 
>>>>
>>>> That's correct, I need the mRNA sequence.
>>>>
>>>>> There is not enough information to do it with knownGene.txt, as you 
>>>>> pointed out, because the coordinates listed are only for the 
>>>>> genome, and tell you nothing about the coordinates of the mRNA.
>>>>
>>>> Why not? I can use the strand information and exonStarts/exonEnds 
>>>> chromosomal coordinates to get the exon sequences from chr?.fa for 
>>>> each known gene.
>>>>
>>>>>
>>>>> Instead you could use kgTargetAli.  It gives information about the 
>>>>> alignment of the mRNA to the genome, and it is in psl format:
>>>>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2
>>>>
>>>> I think I can completely reconstruct the data by using the 
>>>> knownGene.txt.
>>>>
>>>> bin           - of no interest
>>>> matches       - this is the sum of knownGene: exonEnds-exonStarts
>>>> misMatches    - this is always '0' at least for mm9,hg19
>>>> repMatches    - ''
>>>> nCount        - ''
>>>> qNumInsert    - ''
>>>> qBaseInsert   - ''
>>>> tNumInsert    - number of introns in between the exons (number of 
>>>> knownGene: exonEnds/exonStarts-1)
>>>> tBaseInsert   - length of the introns (tNumInsert) - difference 
>>>> between knownGene: exonEnds(n) & exonStarts(n+1)
>>>> strand        - knownGene: strand
>>>> qName         - knownGene: name
>>>> qSize         - same as matches
>>>> qStart        -this is always '0' at least for mm9,hg19
>>>> qEnd          - same as matches
>>>> tName         - knownGene: chrom
>>>> tSize         - of no interest
>>>> tStart        - knownGene: txStart
>>>> tEnd          - knownGene: txEnd
>>>> blockCount    - knownGene: exonCount
>>>> blockSizes    -knownGene: exonEnds-exonStarts
>>>> qStarts       - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1
>>>> tStarts       - knownGene: exonStarts
>>>>
>>>>
>>>> So you see there is no more information (w/o tSize) stored in the 
>>>> kgTargetAli file than in knownGene.
>>>>
>>>>>
>>>>> You could use the qStart and qEnd fields to get the start and end 
>>>>> positions of the parts of each mRNA that aligned.
>>>>
>>>> As mentions above this is the same information I can reconstruct 
>>>> from knownGene. I still have the problem that I can't reconstruct 
>>>> the exact sequence as stored in the knownGeneMrna file.
>>>>
>>>> Coming back to my example 'c008wkk.1'
>>>>
>>>> The entry in kgTargetAli is:
>>>> 81    3675    0    0    0    0    0    9    128942    -    
>>>> uc008wkk.1    3675    0    3675    chr5    152537259    8490335    
>>>> 8622952    10    2254,122,158,169,81,90,86,134,116,465,    
>>>> 0,2254,2376,2534,2703,2784,2874,2960,3094,3210,    
>>>> 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487,
>>>>  
>>>>
>>>>
>>>> I can generate the mRNA sequence using knownGene with a size of 3675 
>>>> bases. On the other hand the sequences in knownGeneMrna has 3700 
>>>> bases (the poly-A tail).
>>>>
>>>> So maybe you know where I can find the additional information to 
>>>> generate the exact sequences as in knownGeneMrna or are they not 
>>>> stored somewhere in the UCSC database?
>>>>
>>>>
>>>> Thanks a lot.
>>>>
>>>> Marten
>>>>
>>>>
>>>>
>>>>>
>>>>> -- 
>>>>> Brooke Rhead
>>>>> UCSC Genome Bioinformatics Group
>>>>>
>>>>>
>>>>>
>>>>> On 02/08/11 03:47, Marten Jäger wrote:
>>>>>> Hi.
>>>>>>
>>>>>> Thanks Brooke for your answer and illustrations. With the given 
>>>>>> links I known understand the problem I run in.
>>>>>>
>>>>>> My intention was to reduce data redundancy and run the motif 
>>>>>> search genome wide only on the exons and assemble the data 
>>>>>> afterwards for each known gene, transcript, ...
>>>>>> As far as I now understand this not possible. On the other hand 
>>>>>> it's not possible the reproduce the exons from knownGeneMrna.txt 
>>>>>> since the exon start / end indices (--> length) from knownGene.txt 
>>>>>> in 1/4-1/5 of the data not match or SNP could not be considered. 
>>>>>> Any suggestions? Maybe I should abandon the idea of data reduction.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Marten
>>>>>>
>>>>>>> Hi Marten,
>>>>>>>
>>>>>>> The differences you are seeing are definitely expected.
>>>>>>>
>>>>>>> The sequence found at 
>>>>>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is 
>>>>>>> the mouse reference genome sequence, and it came from sequencing 
>>>>>>> mouse DNA.  The sequence in knownGeneMrna.txt is based mRNA and 
>>>>>>> protein sequence from several sources (click on the blue "UCSC 
>>>>>>> Genes" link on http://genome.ucsc.edu/cgi-bin/hgTracks to read 
>>>>>>> more about how this file was created).  The knownGeneMrna 
>>>>>>> sequence is aligned to the genomic sequence using BLAT.  The 
>>>>>>> single base differences are SNPs, and the different exon 
>>>>>>> start/end positions are a result of mRNA sequence not aligning to 
>>>>>>> the genome, for instance, when there is a polyA tail on the mRNA.
>>>>>>>
>>>>>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt 
>>>>>>> sequence rather than the genomic sequence.
>>>>>>>
>>>>>>> I hope this is helpful.  If you have further questions, please 
>>>>>>> feel free to contact us again at [email protected].
>>>>>>>
>>>>>>> -- 
>>>>>>> Brooke Rhead
>>>>>>> UCSC Genome Bioinformatics Group
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 02/07/11 05:00, Marten Jäger wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I downloaded the chromosomal sequences 
>>>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) 
>>>>>>>> and the Database files 
>>>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for 
>>>>>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the 
>>>>>>>> chromosomal locations for the exons using knownGene.txt I 
>>>>>>>> extracted the mRNA Sequences for the knownGenes and compared 
>>>>>>>> them to the sequences in knownGeneMrna.txt. Unfortunately about 
>>>>>>>> 1/4 of the sequences differ in single nucleotide mutations
>>>>>>>>
>>>>>>>> substitution: uc008wki.1
>>>>>>>>
>>>>>>>> ...cctcctAtactggagct...
>>>>>>>> ...cctcctGtactggagct...
>>>>>>>>
>>>>>>>> or different exon start/end positions:
>>>>>>>>
>>>>>>>> start: uc008wjb.1
>>>>>>>>
>>>>>>>> cggcgtgggactgggagtccgtcc...
>>>>>>>>    gcgtgggactgggagtccgtccgg...
>>>>>>>>
>>>>>>>> end: uc008wkk.1
>>>>>>>>
>>>>>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa
>>>>>>>> ...gatttttttaaccata
>>>>>>>>
>>>>>>>>
>>>>>>>> Can anyone please explain these differences and/or give me a 
>>>>>>>> hint which data to use (I'm looking for motifs in the processed 
>>>>>>>> mRNA).
>>>>>>>>
>>>>>>>> Many Thanks.
>>>>>>>>
>>>>>>>> Marten
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
> 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Differences in UCSC DB and chr?.fa files

Reply via email to