Re: [Genome] Differences in UCSC DB and chr?.fa files

Marten Jäger Thu, 10 Feb 2011 05:28:21 -0800

Hi.

You're right. It's seems to get more and more confusing.


> Hi Marten,
>
> I think I've somehow made this more confusing than it should be!  Let 
> me start by answering your most recent questions:
>
>> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and
>> I assume that the peptide sequences stored in the knownGeneMrna are
>> taken from RefSeq/GenBank.
>
> Right.  The whole process is described on the UCSC Genes track details 
> page.  One way to see that is to go to the Table Browser 
> (http://genome.ucsc.edu/cgi-bin/hgTables), select the UCSC Genes 
> track, and hit the "describe table schema" button.  You will also be 
> able to see a list of the tables related to the knownGene table.

Okay, the Table Browser->"describe table schema" start point for data 
research. I read the descriptions and had a look on referenced tables.

>
>> Is there a table where I an find the information from the BLAT
>> alignment (missmatches,indels,...)?
>
> Yes, the kgTargetAli table, which is in PSL format (and PSL is the 
> alignment format that is output by BLAT).

Maybe I am wrong or does not understand the format of this table  but 
for me it seems that the kgTargetAli table is incomplete or wrong?! For 
hg19/mm9 all counts for mismatches in the alignments of the
RefSeq and GenBank RNAs (is this correct? I assume from 
knownGenePipeline step 1.) to the chromosomes is '0' on the other hand 
the count for matches is the exact number as the complete jointed 
sequence of all exons.




>
> Maybe you can clarify again what it is you are trying to do.  Do you 
> want chromosomal/genomic sequence for each UCSC Gene?  Or are you 
> trying to get mRNA sequence?
>
> If it is the former, you can do it quite easily with the Table Browser 
> by selecting the UCSC Genes track, then "output format: sequence," and 
> then choose "genomic" on the next page.  There are options to retrieve 
> sequence for only the exons.  (There is no such option for the mRNA or 
> protein sequence.)
>
> Let us know what you are trying to accomplish and what your 
> outstanding questions are, and I or someone else on the team can try 
> to help.

So my intention was to predict sequence motifs on mRNA sequences. To 
reduce redundancy I assumed it would be good to do this on exon level, 
since selections of exons of one gene are reassembled to various 
transcripts by alternative splicing and I am especially interested in 
motifs spanning the exon-exon junction. Therefore I build up a database 
which stores the exon sequences (and their links to the transcripts). To 
validate my scripts I assembled the transcript sequences, by translating 
the chromosomal sequence into mRNA, and compared them to those in 
knownGeneMrna. Here I run in the problem that 1/4-1/5 of the assembled 
mRNA sequences does not match the sequences in knownGeneMrna. So I 
started to check manually where the difference are and run into various 
samples (disregarding poly-A tails).
I asked and you mentioned that the alignments can be found in the 
kgTargetAli file. Unfortunately I could not find informations in the 
table to clarify these questions.

To come back to my examples:

deletion: uc008whh.1

knownGeneMrna:   ...tttctgtttttttttttttttttttttt-aacctagaatct...
assembled exons: ...tttctgttttttttttttttttttttttTaacctagaatct...

I found this line
612    2520    0    0    0    0    0    4    5429    -    uc008whh.1    
2520    0    2520    chr5    152537259    3639968    3647917    5    
1612,184,60,126,538,    0,1612,1796,1856,1982,    
3639968,3643557,3644880,3646783,3647379,

but would expect something like:
612    2520    0    0    0    1    1    4    5429    -    uc008whh.1    
2520    0    2520    chr5    152537259    3639968    3647917    5    
1612,184,60,126,538,    0,1612,1796,1856,1982,    
3639968,3643557,3644880,3646783,3647379,


substitution: uc008wki.1

knownGeneMrna: ...cctcctAtactggagct...
assembled exons: ...cctcctGtactggagct...


kgTargetAli:
649    3707    0    0    0    0    0    12    33434    +    
uc008wki.1    3707    0    3707    chr5    ...

expect:
649    3706    1    0    0    0    0    12    33434    +    
uc008wki.1    3707    0    3707    chr5    ...



various:uc008wii.1


kgTargetAli:
9    4509    0    0    0    0    0    14    571956    -    uc008wii.1    
4509    0    4509    chr5    ...


expect:
9    4509    4    0    0    4    8    16    571958    -    uc008wii.1    
4509    0    4509    chr5    ... qStarts should also start with13, ...


alignment:
 >_                                                 4529 nt vs.
 >_                                                 4509 nt
scoring matrix: , gap penalties: -12/-2
99.4% identity;        Global alignment score: 17865

                10        20        30        40        50        60
649550 AATTCGGCACGAGCGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA
                     :::::::::::::::::::::::::::::::::::::::::::::::
_      -------------CGCCGTTGTCTGCGCTGCGCTGCGCTGCGCTGGACCAGTTTCGCGAA
                             10        20        30        40

  ...

               730       740       750       760       770       780
649550 CGTGCACACTGATTTATGTCAGTACATGGAACAGCACCCTGGAGGACTCCATCCAGATAA
        ::::::::::::::::::::::::::::::  ::::::::::::::::::::::::::::
_      CGTGCACACTGATTTATGTCAGTACATGGACAAGCACCCTGGAGGACTCCATCCAGATAA
        710       720       730       740       750       760

...

              1750      1760      1770      1780      1790      1800
649550 AAGAACTACGTTACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT
        ::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::::
_      AAGAACTACGTGACGATTAAGCTTTGCTTACTGCTACATGGCATGTATTCTTTTCCGTCT
       1730      1740      1750      1760      1770      1780

...

              1990      2000      2010      2020      2030      2040
649550 TGTTTTCCCTGAGAGCAGAGTGCATTCTGCAACCTCCAGGGAAGAACATTCTTTTTGCTA
        :::::::::::::::::: ::::::::::::::::::::: :::::::::::::::::::
_      TGTTTTCCCTGAGAGCAGGGTGCATTCTGCAACCTCCAGG-AAGAACATTCTTTTTGCTA
       1970      1980      1990      2000       2010      2020

...

              2470      2480      2490      2500      2510      2520
649550 GAAAAAAAAAAATCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT
        ::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::::
_      GAAAAAAAAAA-TCTGTCTGTCAGGGTAGGTCCTGAATGCAGCCTTGGCTGATTAAAGCT
        2450       2460      2470      2480      2490      2500

              2530      2540      2550      2560      2570      2580
649550 TAGAAATCACATTTTATAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC
        :::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::
_      TAGAAATCACATTTTAAAATTATCCAGACTTTAAAATGTGCTTATTTACGACAAAGGACC
         2510      2520      2530      2540      2550      2560

              2590      2600      2610      2620      2630      2640
649550 TTTGAATTTAATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC
        :::::     ::::::::::::::::::::::::::::::::::::::::::::::::::
_      TTTGA-----ATTCGATGTTCAGAAACATTCCAGGCCGTTCGGAAGGCATCACTGGGTAC
         2570           2580      2590      2600      2610      2620

...

              3490       3500      3510      3520      3530
649550 GAAGATTATGTTTGT-TTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGGG
        ::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::
_      GAAGATTATGTTTGTATTTCACTAAGTAGAAGTCAGGAGCTCACAGGAATGCTGGGAGG-
              3470      3480      3490      3500      3510

...

     3720      3730      3740      3750      3760      3770
649550 TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTTTGCTCTCCTGGCCT
        :::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::
_      TCTGCCCCCACCCCTCCACCCAACACAGTCCCCCTTCTCTGGCTTT-GCTCTCCTGGCCT
     3700      3710      3720      3730      3740       3750

...

     4020      4030      4040       4050      4060      4070
649550 ATTAAATACAACATCCATGGGACAGGAAA-TGTGTTTGCTATAAAATTAGAGATATAAGG
        ::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::
_      ATTAAATACAACATCCATGGGACAGGAAAATGTGTTTGCTATAAAATTAGAGATATAAGG
      4000      4010      4020      4030      4040      4050

...



Is it correct that small indels and mismatches in the query are not 
reported by the PSL format?


Maybe a workaround would be to use the knownGeneMra sequences. However 
this way I would need the start/end positions of the exons in the query 
sequences/mRNAs which does not match these in knownGene or kgTargetAli. 
(seeuc008wii.1)

Any suggestions?


Thanks.

Marten



>
> -- 
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
>
> On 02/09/11 02:39, Marten Jäger wrote:
>> Hi.
>>
>> I am told that the given example was a bad choice (since the poly-A 
>> tail is not encoded in the chromosomal sequence).  Nonetheless there 
>> are better examples:
>>
>> uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509    
>> knownGeneMrna sequence length: 4529
>>
>>
>> uc008wjb.1  - kgTargetAli & knownGene assembled exon length: 1208    
>> knownGeneMrna sequence length: 1210
>>
>> For both examples there seem to be index errors for the exon starts 
>> and or stops coordinates...?
>>
>> uc008whh.1 - there is a single 't' missing in the knownGeneMrna 
>> sequence (1. exon) in comparison to the chromosomal sequence.
>>
>> There are a lot of examples where the sequences only differ in SNPs 
>> or micro indels.
>>
>> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and 
>> I assume that the peptide sequences stored in the knownGeneMrna are 
>> taken from RefSeq/GenBank.
>> Is there a table where I an find the information from the BLAT 
>> alignment (missmatches,indels,...)?
>>
>>
>> Marten
>>
>>
>>> Hi Brooke,
>>>
>>>
>>>> Hi Marten,
>>>>
>>>> So, for each known gene, you want to generate a sequence that 
>>>> consists of only the exons, correct? 
>>>
>>> That's correct, I need the mRNA sequence.
>>>
>>>> There is not enough information to do it with knownGene.txt, as you 
>>>> pointed out, because the coordinates listed are only for the 
>>>> genome, and tell you nothing about the coordinates of the mRNA.
>>>
>>> Why not? I can use the strand information and exonStarts/exonEnds 
>>> chromosomal coordinates to get the exon sequences from chr?.fa for 
>>> each known gene.
>>>
>>>>
>>>> Instead you could use kgTargetAli.  It gives information about the 
>>>> alignment of the mRNA to the genome, and it is in psl format:
>>>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2
>>>
>>> I think I can completely reconstruct the data by using the 
>>> knownGene.txt.
>>>
>>> bin           - of no interest
>>> matches       - this is the sum of knownGene: exonEnds-exonStarts
>>> misMatches    - this is always '0' at least for mm9,hg19
>>> repMatches    - ''
>>> nCount        - ''
>>> qNumInsert    - ''
>>> qBaseInsert   - ''
>>> tNumInsert    - number of introns in between the exons (number of 
>>> knownGene: exonEnds/exonStarts-1)
>>> tBaseInsert   - length of the introns (tNumInsert) - difference 
>>> between knownGene: exonEnds(n) & exonStarts(n+1)
>>> strand        - knownGene: strand
>>> qName         - knownGene: name
>>> qSize         - same as matches
>>> qStart        -this is always '0' at least for mm9,hg19
>>> qEnd          - same as matches
>>> tName         - knownGene: chrom
>>> tSize         - of no interest
>>> tStart        - knownGene: txStart
>>> tEnd          - knownGene: txEnd
>>> blockCount    - knownGene: exonCount
>>> blockSizes    -knownGene: exonEnds-exonStarts
>>> qStarts       - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1
>>> tStarts       - knownGene: exonStarts
>>>
>>>
>>> So you see there is no more information (w/o tSize) stored in the 
>>> kgTargetAli file than in knownGene.
>>>
>>>>
>>>> You could use the qStart and qEnd fields to get the start and end 
>>>> positions of the parts of each mRNA that aligned.
>>>
>>> As mentions above this is the same information I can reconstruct 
>>> from knownGene. I still have the problem that I can't reconstruct 
>>> the exact sequence as stored in the knownGeneMrna file.
>>>
>>> Coming back to my example 'c008wkk.1'
>>>
>>> The entry in kgTargetAli is:
>>> 81    3675    0    0    0    0    0    9    128942    -    
>>> uc008wkk.1    3675    0    3675    chr5    152537259    8490335    
>>> 8622952    10    2254,122,158,169,81,90,86,134,116,465,    
>>> 0,2254,2376,2534,2703,2784,2874,2960,3094,3210,    
>>> 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487,
>>>  
>>>
>>>
>>> I can generate the mRNA sequence using knownGene with a size of 3675 
>>> bases. On the other hand the sequences in knownGeneMrna has 3700 
>>> bases (the poly-A tail).
>>>
>>> So maybe you know where I can find the additional information to 
>>> generate the exact sequences as in knownGeneMrna or are they not 
>>> stored somewhere in the UCSC database?
>>>
>>>
>>> Thanks a lot.
>>>
>>> Marten
>>>
>>>
>>>
>>>>
>>>> -- 
>>>> Brooke Rhead
>>>> UCSC Genome Bioinformatics Group
>>>>
>>>>
>>>>
>>>> On 02/08/11 03:47, Marten Jäger wrote:
>>>>> Hi.
>>>>>
>>>>> Thanks Brooke for your answer and illustrations. With the given 
>>>>> links I known understand the problem I run in.
>>>>>
>>>>> My intention was to reduce data redundancy and run the motif 
>>>>> search genome wide only on the exons and assemble the data 
>>>>> afterwards for each known gene, transcript, ...
>>>>> As far as I now understand this not possible. On the other hand 
>>>>> it's not possible the reproduce the exons from knownGeneMrna.txt 
>>>>> since the exon start / end indices (--> length) from knownGene.txt 
>>>>> in 1/4-1/5 of the data not match or SNP could not be considered. 
>>>>> Any suggestions? Maybe I should abandon the idea of data reduction.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Marten
>>>>>
>>>>>> Hi Marten,
>>>>>>
>>>>>> The differences you are seeing are definitely expected.
>>>>>>
>>>>>> The sequence found at 
>>>>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is 
>>>>>> the mouse reference genome sequence, and it came from sequencing 
>>>>>> mouse DNA.  The sequence in knownGeneMrna.txt is based mRNA and 
>>>>>> protein sequence from several sources (click on the blue "UCSC 
>>>>>> Genes" link on http://genome.ucsc.edu/cgi-bin/hgTracks to read 
>>>>>> more about how this file was created).  The knownGeneMrna 
>>>>>> sequence is aligned to the genomic sequence using BLAT.  The 
>>>>>> single base differences are SNPs, and the different exon 
>>>>>> start/end positions are a result of mRNA sequence not aligning to 
>>>>>> the genome, for instance, when there is a polyA tail on the mRNA.
>>>>>>
>>>>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt 
>>>>>> sequence rather than the genomic sequence.
>>>>>>
>>>>>> I hope this is helpful.  If you have further questions, please 
>>>>>> feel free to contact us again at [email protected].
>>>>>>
>>>>>> -- 
>>>>>> Brooke Rhead
>>>>>> UCSC Genome Bioinformatics Group
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02/07/11 05:00, Marten Jäger wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I downloaded the chromosomal sequences 
>>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) 
>>>>>>> and the Database files 
>>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for 
>>>>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the 
>>>>>>> chromosomal locations for the exons using knownGene.txt I 
>>>>>>> extracted the mRNA Sequences for the knownGenes and compared 
>>>>>>> them to the sequences in knownGeneMrna.txt. Unfortunately about 
>>>>>>> 1/4 of the sequences differ in single nucleotide mutations
>>>>>>>
>>>>>>> substitution: uc008wki.1
>>>>>>>
>>>>>>> ...cctcctAtactggagct...
>>>>>>> ...cctcctGtactggagct...
>>>>>>>
>>>>>>> or different exon start/end positions:
>>>>>>>
>>>>>>> start: uc008wjb.1
>>>>>>>
>>>>>>> cggcgtgggactgggagtccgtcc...
>>>>>>>    gcgtgggactgggagtccgtccgg...
>>>>>>>
>>>>>>> end: uc008wkk.1
>>>>>>>
>>>>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa
>>>>>>> ...gatttttttaaccata
>>>>>>>
>>>>>>>
>>>>>>> Can anyone please explain these differences and/or give me a 
>>>>>>> hint which data to use (I'm looking for motifs in the processed 
>>>>>>> mRNA).
>>>>>>>
>>>>>>> Many Thanks.
>>>>>>>
>>>>>>> Marten
>>>>>>>
>>>>>>>
>>>>>
>>>
>>

-- 
Marten Jäger, Msc Bioinformatik
Charité - Universitätsmedizin Berlin
Campus Virchow Klinikum
Institut für Medizinische Genetik und Humangenetik
Augustenburger Platz 1
13353 Berlin
Germany
phone:  +49/30/450 569135
email:  [email protected]
http://genetik.charite.de/institut/
http://compbio.charite.de

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Differences in UCSC DB and chr?.fa files

Reply via email to