Re: [Genome] Differences in UCSC DB and chr?.fa files

Brooke Rhead Wed, 09 Feb 2011 17:11:23 -0800

Hi Marten,

I think I've somehow made this more confusing than it should be!  Let me 
start by answering your most recent questions:


> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and
> I assume that the peptide sequences stored in the knownGeneMrna are
> taken from RefSeq/GenBank.

Right.  The whole process is described on the UCSC Genes track details 
page.  One way to see that is to go to the Table Browser 
(http://genome.ucsc.edu/cgi-bin/hgTables), select the UCSC Genes track, 
and hit the "describe table schema" button.  You will also be able to 
see a list of the tables related to the knownGene table.

> Is there a table where I an find the information from the BLAT
> alignment (missmatches,indels,...)?

Yes, the kgTargetAli table, which is in PSL format (and PSL is the 
alignment format that is output by BLAT).

Maybe you can clarify again what it is you are trying to do.  Do you 
want chromosomal/genomic sequence for each UCSC Gene?  Or are you trying 
to get mRNA sequence?

If it is the former, you can do it quite easily with the Table Browser 
by selecting the UCSC Genes track, then "output format: sequence," and 
then choose "genomic" on the next page.  There are options to retrieve 
sequence for only the exons.  (There is no such option for the mRNA or 
protein sequence.)

Let us know what you are trying to accomplish and what your outstanding 
questions are, and I or someone else on the team can try to help.

--
Brooke Rhead
UCSC Genome Bioinformatics Group



On 02/09/11 02:39, Marten Jäger wrote:
> Hi.
> 
> I am told that the given example was a bad choice (since the poly-A tail 
> is not encoded in the chromosomal sequence).  Nonetheless there are 
> better examples:
> 
> uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509    
> knownGeneMrna sequence length: 4529
> 
> 
> uc008wjb.1  - kgTargetAli & knownGene assembled exon length: 1208    
> knownGeneMrna sequence length: 1210
> 
> For both examples there seem to be index errors for the exon starts and 
> or stops coordinates...?
> 
> uc008whh.1 - there is a single 't' missing in the knownGeneMrna sequence 
> (1. exon) in comparison to the chromosomal sequence.
> 
> There are a lot of examples where the sequences only differ in SNPs or 
> micro indels.
> 
> Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and I 
> assume that the peptide sequences stored in the knownGeneMrna are taken 
> from RefSeq/GenBank.
> Is there a table where I an find the information from the BLAT alignment 
> (missmatches,indels,...)?
> 
> 
> Marten
> 
> 
>> Hi Brooke,
>>
>>
>>> Hi Marten,
>>>
>>> So, for each known gene, you want to generate a sequence that 
>>> consists of only the exons, correct? 
>>
>> That's correct, I need the mRNA sequence.
>>
>>> There is not enough information to do it with knownGene.txt, as you 
>>> pointed out, because the coordinates listed are only for the genome, 
>>> and tell you nothing about the coordinates of the mRNA.
>>
>> Why not? I can use the strand information and exonStarts/exonEnds 
>> chromosomal coordinates to get the exon sequences from chr?.fa for 
>> each known gene.
>>
>>>
>>> Instead you could use kgTargetAli.  It gives information about the 
>>> alignment of the mRNA to the genome, and it is in psl format:
>>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2
>>
>> I think I can completely reconstruct the data by using the knownGene.txt.
>>
>> bin           - of no interest
>> matches       - this is the sum of knownGene: exonEnds-exonStarts
>> misMatches    - this is always '0' at least for mm9,hg19
>> repMatches    - ''
>> nCount        - ''
>> qNumInsert    - ''
>> qBaseInsert   - ''
>> tNumInsert    - number of introns in between the exons (number of 
>> knownGene: exonEnds/exonStarts-1)
>> tBaseInsert   - length of the introns (tNumInsert) - difference 
>> between knownGene: exonEnds(n) & exonStarts(n+1)
>> strand        - knownGene: strand
>> qName         - knownGene: name
>> qSize         - same as matches
>> qStart        -this is always '0' at least for mm9,hg19
>> qEnd          - same as matches
>> tName         - knownGene: chrom
>> tSize         - of no interest
>> tStart        - knownGene: txStart
>> tEnd          - knownGene: txEnd
>> blockCount    - knownGene: exonCount
>> blockSizes    -knownGene: exonEnds-exonStarts
>> qStarts       - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1
>> tStarts       - knownGene: exonStarts
>>
>>
>> So you see there is no more information (w/o tSize) stored in the 
>> kgTargetAli file than in knownGene.
>>
>>>
>>> You could use the qStart and qEnd fields to get the start and end 
>>> positions of the parts of each mRNA that aligned.
>>
>> As mentions above this is the same information I can reconstruct from 
>> knownGene. I still have the problem that I can't reconstruct the exact 
>> sequence as stored in the knownGeneMrna file.
>>
>> Coming back to my example 'c008wkk.1'
>>
>> The entry in kgTargetAli is:
>> 81    3675    0    0    0    0    0    9    128942    -    
>> uc008wkk.1    3675    0    3675    chr5    152537259    8490335    
>> 8622952    10    2254,122,158,169,81,90,86,134,116,465,    
>> 0,2254,2376,2534,2703,2784,2874,2960,3094,3210,    
>> 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487,
>>  
>>
>>
>> I can generate the mRNA sequence using knownGene with a size of 3675 
>> bases. On the other hand the sequences in knownGeneMrna has 3700 bases 
>> (the poly-A tail).
>>
>> So maybe you know where I can find the additional information to 
>> generate the exact sequences as in knownGeneMrna or are they not 
>> stored somewhere in the UCSC database?
>>
>>
>> Thanks a lot.
>>
>> Marten
>>
>>
>>
>>>
>>> -- 
>>> Brooke Rhead
>>> UCSC Genome Bioinformatics Group
>>>
>>>
>>>
>>> On 02/08/11 03:47, Marten Jäger wrote:
>>>> Hi.
>>>>
>>>> Thanks Brooke for your answer and illustrations. With the given 
>>>> links I known understand the problem I run in.
>>>>
>>>> My intention was to reduce data redundancy and run the motif search 
>>>> genome wide only on the exons and assemble the data afterwards for 
>>>> each known gene, transcript, ...
>>>> As far as I now understand this not possible. On the other hand it's 
>>>> not possible the reproduce the exons from knownGeneMrna.txt since 
>>>> the exon start / end indices (--> length) from knownGene.txt in 
>>>> 1/4-1/5 of the data not match or SNP could not be considered. Any 
>>>> suggestions? Maybe I should abandon the idea of data reduction.
>>>>
>>>> Thanks.
>>>>
>>>> Marten
>>>>
>>>>> Hi Marten,
>>>>>
>>>>> The differences you are seeing are definitely expected.
>>>>>
>>>>> The sequence found at 
>>>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the 
>>>>> mouse reference genome sequence, and it came from sequencing mouse 
>>>>> DNA.  The sequence in knownGeneMrna.txt is based mRNA and protein 
>>>>> sequence from several sources (click on the blue "UCSC Genes" link 
>>>>> on http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how 
>>>>> this file was created).  The knownGeneMrna sequence is aligned to 
>>>>> the genomic sequence using BLAT.  The single base differences are 
>>>>> SNPs, and the different exon start/end positions are a result of 
>>>>> mRNA sequence not aligning to the genome, for instance, when there 
>>>>> is a polyA tail on the mRNA.
>>>>>
>>>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt 
>>>>> sequence rather than the genomic sequence.
>>>>>
>>>>> I hope this is helpful.  If you have further questions, please feel 
>>>>> free to contact us again at [email protected].
>>>>>
>>>>> -- 
>>>>> Brooke Rhead
>>>>> UCSC Genome Bioinformatics Group
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 02/07/11 05:00, Marten Jäger wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I downloaded the chromosomal sequences 
>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and 
>>>>>> the Database files 
>>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for 
>>>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the 
>>>>>> chromosomal locations for the exons using knownGene.txt I 
>>>>>> extracted the mRNA Sequences for the knownGenes and compared them 
>>>>>> to the sequences in knownGeneMrna.txt. Unfortunately about 1/4 of 
>>>>>> the sequences differ in single nucleotide mutations
>>>>>>
>>>>>> substitution: uc008wki.1
>>>>>>
>>>>>> ...cctcctAtactggagct...
>>>>>> ...cctcctGtactggagct...
>>>>>>
>>>>>> or different exon start/end positions:
>>>>>>
>>>>>> start: uc008wjb.1
>>>>>>
>>>>>> cggcgtgggactgggagtccgtcc...
>>>>>>    gcgtgggactgggagtccgtccgg...
>>>>>>
>>>>>> end: uc008wkk.1
>>>>>>
>>>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa
>>>>>> ...gatttttttaaccata
>>>>>>
>>>>>>
>>>>>> Can anyone please explain these differences and/or give me a hint 
>>>>>> which data to use (I'm looking for motifs in the processed mRNA).
>>>>>>
>>>>>> Many Thanks.
>>>>>>
>>>>>> Marten
>>>>>>
>>>>>>
>>>>
>>
> 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Differences in UCSC DB and chr?.fa files

Reply via email to