Re: [Genome] Differences in UCSC DB and chr?.fa files

Marten Jäger Tue, 08 Feb 2011 23:11:39 -0800

Hi Brooke,


> Hi Marten,
>
> So, for each known gene, you want to generate a sequence that consists 
> of only the exons, correct? 

That's correct, I need the mRNA sequence.

> There is not enough information to do it with knownGene.txt, as you 
> pointed out, because the coordinates listed are only for the genome, 
> and tell you nothing about the coordinates of the mRNA.

Why not? I can use the strand information and exonStarts/exonEnds 
chromosomal coordinates to get the exon sequences from chr?.fa for each 
known gene.

>
> Instead you could use kgTargetAli.  It gives information about the 
> alignment of the mRNA to the genome, and it is in psl format:
> http://genome.ucsc.edu/FAQ/FAQformat.html#format2

I think I can completely reconstruct the data by using the knownGene.txt.

bin           - of no interest
matches       - this is the sum of knownGene: exonEnds-exonStarts
misMatches    - this is always '0' at least for mm9,hg19
repMatches    - ''
nCount        - ''
qNumInsert    - ''
qBaseInsert   - ''
tNumInsert    - number of introns in between the exons (number of 
knownGene: exonEnds/exonStarts-1)
tBaseInsert   - length of the introns (tNumInsert) - difference between 
knownGene: exonEnds(n) & exonStarts(n+1)
strand        - knownGene: strand
qName         - knownGene: name
qSize         - same as matches
qStart        -this is always '0' at least for mm9,hg19
qEnd          - same as matches
tName         - knownGene: chrom
tSize         - of no interest
tStart        - knownGene: txStart
tEnd          - knownGene: txEnd
blockCount    - knownGene: exonCount
blockSizes    -knownGene: exonEnds-exonStarts
qStarts       - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1
tStarts       - knownGene: exonStarts


So you see there is no more information (w/o tSize) stored in the 
kgTargetAli file than in knownGene.

>
> You could use the qStart and qEnd fields to get the start and end 
> positions of the parts of each mRNA that aligned.

As mentions above this is the same information I can reconstruct from 
knownGene. I still have the problem that I can't reconstruct the exact 
sequence as stored in the knownGeneMrna file.

Coming back to my example 'c008wkk.1'

The entry in kgTargetAli is:
81    3675    0    0    0    0    0    9    128942    -    uc008wkk.1    
3675    0    3675    chr5    152537259    8490335    8622952    10    
2254,122,158,169,81,90,86,134,116,465,    
0,2254,2376,2534,2703,2784,2874,2960,3094,3210,    
8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487,

I can generate the mRNA sequence using knownGene with a size of 3675 
bases. On the other hand the sequences in knownGeneMrna has 3700 bases 
(the poly-A tail).

So maybe you know where I can find the additional information to 
generate the exact sequences as in knownGeneMrna or are they not stored 
somewhere in the UCSC database?


Thanks a lot.

Marten



>
> -- 
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
>
> On 02/08/11 03:47, Marten Jäger wrote:
>> Hi.
>>
>> Thanks Brooke for your answer and illustrations. With the given links 
>> I known understand the problem I run in.
>>
>> My intention was to reduce data redundancy and run the motif search 
>> genome wide only on the exons and assemble the data afterwards for 
>> each known gene, transcript, ...
>> As far as I now understand this not possible. On the other hand it's 
>> not possible the reproduce the exons from knownGeneMrna.txt since the 
>> exon start / end indices (--> length) from knownGene.txt in 1/4-1/5 
>> of the data not match or SNP could not be considered. Any 
>> suggestions? Maybe I should abandon the idea of data reduction.
>>
>> Thanks.
>>
>> Marten
>>
>>> Hi Marten,
>>>
>>> The differences you are seeing are definitely expected.
>>>
>>> The sequence found at 
>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the 
>>> mouse reference genome sequence, and it came from sequencing mouse 
>>> DNA.  The sequence in knownGeneMrna.txt is based mRNA and protein 
>>> sequence from several sources (click on the blue "UCSC Genes" link 
>>> on http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how 
>>> this file was created).  The knownGeneMrna sequence is aligned to 
>>> the genomic sequence using BLAT.  The single base differences are 
>>> SNPs, and the different exon start/end positions are a result of 
>>> mRNA sequence not aligning to the genome, for instance, when there 
>>> is a polyA tail on the mRNA.
>>>
>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt 
>>> sequence rather than the genomic sequence.
>>>
>>> I hope this is helpful.  If you have further questions, please feel 
>>> free to contact us again at [email protected].
>>>
>>> -- 
>>> Brooke Rhead
>>> UCSC Genome Bioinformatics Group
>>>
>>>
>>>
>>>
>>> On 02/07/11 05:00, Marten Jäger wrote:
>>>> Hi,
>>>>
>>>> I downloaded the chromosomal sequences 
>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and 
>>>> the Database files 
>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for 
>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the 
>>>> chromosomal locations for the exons using knownGene.txt I extracted 
>>>> the mRNA Sequences for the knownGenes and compared them to the 
>>>> sequences in knownGeneMrna.txt. Unfortunately about 1/4 of the 
>>>> sequences differ in single nucleotide mutations
>>>>
>>>> substitution: uc008wki.1
>>>>
>>>> ...cctcctAtactggagct...
>>>> ...cctcctGtactggagct...
>>>>
>>>> or different exon start/end positions:
>>>>
>>>> start: uc008wjb.1
>>>>
>>>> cggcgtgggactgggagtccgtcc...
>>>>    gcgtgggactgggagtccgtccgg...
>>>>
>>>> end: uc008wkk.1
>>>>
>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa
>>>> ...gatttttttaaccata
>>>>
>>>>
>>>> Can anyone please explain these differences and/or give me a hint 
>>>> which data to use (I'm looking for motifs in the processed mRNA).
>>>>
>>>> Many Thanks.
>>>>
>>>> Marten
>>>>
>>>>
>>

-- 
Marten Jäger, Msc Bioinformatik
Charité - Universitätsmedizin Berlin
Campus Virchow Klinikum
Institut für Medizinische Genetik und Humangenetik
Augustenburger Platz 1
13353 Berlin
Germany
phone:  +49/30/450 569135
email:  [email protected]
http://genetik.charite.de/institut/
http://compbio.charite.de

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Differences in UCSC DB and chr?.fa files

Reply via email to