Re: [Genome] Differences in UCSC DB and chr?.fa files

Marten Jäger Wed, 09 Feb 2011 02:40:59 -0800

Hi.

I am told that the given example was a bad choice (since the poly-A tail 
is not encoded in the chromosomal sequence).  Nonetheless there are 
better examples:


uc008wii.1 - kgTargetAli & knownGene assembled exon length: 4509    
knownGeneMrna sequence length: 4529


uc008wjb.1  - kgTargetAli & knownGene assembled exon length: 1208    
knownGeneMrna sequence length: 1210

For both examples there seem to be index errors for the exon starts and 
or stops coordinates...?

uc008whh.1 - there is a single 't' missing in the knownGeneMrna sequence 
(1. exon) in comparison to the chromosomal sequence.

There are a lot of examples where the sequences only differ in SNPs or 
micro indels.

Okay, the RefSeq and GenBank RNAs were aligned to the chromosomes and I 
assume that the peptide sequences stored in the knownGeneMrna are taken 
from RefSeq/GenBank.
Is there a table where I an find the information from the BLAT alignment 
(missmatches,indels,...)?


Marten


> Hi Brooke,
>
>
>> Hi Marten,
>>
>> So, for each known gene, you want to generate a sequence that 
>> consists of only the exons, correct? 
>
> That's correct, I need the mRNA sequence.
>
>> There is not enough information to do it with knownGene.txt, as you 
>> pointed out, because the coordinates listed are only for the genome, 
>> and tell you nothing about the coordinates of the mRNA.
>
> Why not? I can use the strand information and exonStarts/exonEnds 
> chromosomal coordinates to get the exon sequences from chr?.fa for 
> each known gene.
>
>>
>> Instead you could use kgTargetAli.  It gives information about the 
>> alignment of the mRNA to the genome, and it is in psl format:
>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2
>
> I think I can completely reconstruct the data by using the knownGene.txt.
>
> bin           - of no interest
> matches       - this is the sum of knownGene: exonEnds-exonStarts
> misMatches    - this is always '0' at least for mm9,hg19
> repMatches    - ''
> nCount        - ''
> qNumInsert    - ''
> qBaseInsert   - ''
> tNumInsert    - number of introns in between the exons (number of 
> knownGene: exonEnds/exonStarts-1)
> tBaseInsert   - length of the introns (tNumInsert) - difference 
> between knownGene: exonEnds(n) & exonStarts(n+1)
> strand        - knownGene: strand
> qName         - knownGene: name
> qSize         - same as matches
> qStart        -this is always '0' at least for mm9,hg19
> qEnd          - same as matches
> tName         - knownGene: chrom
> tSize         - of no interest
> tStart        - knownGene: txStart
> tEnd          - knownGene: txEnd
> blockCount    - knownGene: exonCount
> blockSizes    -knownGene: exonEnds-exonStarts
> qStarts       - 0, sum(exonEnds(i)-exonStarts(i)) from i= 1 : n-1
> tStarts       - knownGene: exonStarts
>
>
> So you see there is no more information (w/o tSize) stored in the 
> kgTargetAli file than in knownGene.
>
>>
>> You could use the qStart and qEnd fields to get the start and end 
>> positions of the parts of each mRNA that aligned.
>
> As mentions above this is the same information I can reconstruct from 
> knownGene. I still have the problem that I can't reconstruct the exact 
> sequence as stored in the knownGeneMrna file.
>
> Coming back to my example 'c008wkk.1'
>
> The entry in kgTargetAli is:
> 81    3675    0    0    0    0    0    9    128942    -    
> uc008wkk.1    3675    0    3675    chr5    152537259    8490335    
> 8622952    10    2254,122,158,169,81,90,86,134,116,465,    
> 0,2254,2376,2534,2703,2784,2874,2960,3094,3210,    
> 8490335,8494783,8520858,8528605,8548235,8559411,8569494,8579024,8603869,8622487,
>
> I can generate the mRNA sequence using knownGene with a size of 3675 
> bases. On the other hand the sequences in knownGeneMrna has 3700 bases 
> (the poly-A tail).
>
> So maybe you know where I can find the additional information to 
> generate the exact sequences as in knownGeneMrna or are they not 
> stored somewhere in the UCSC database?
>
>
> Thanks a lot.
>
> Marten
>
>
>
>>
>> -- 
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>>
>> On 02/08/11 03:47, Marten Jäger wrote:
>>> Hi.
>>>
>>> Thanks Brooke for your answer and illustrations. With the given 
>>> links I known understand the problem I run in.
>>>
>>> My intention was to reduce data redundancy and run the motif search 
>>> genome wide only on the exons and assemble the data afterwards for 
>>> each known gene, transcript, ...
>>> As far as I now understand this not possible. On the other hand it's 
>>> not possible the reproduce the exons from knownGeneMrna.txt since 
>>> the exon start / end indices (--> length) from knownGene.txt in 
>>> 1/4-1/5 of the data not match or SNP could not be considered. Any 
>>> suggestions? Maybe I should abandon the idea of data reduction.
>>>
>>> Thanks.
>>>
>>> Marten
>>>
>>>> Hi Marten,
>>>>
>>>> The differences you are seeing are definitely expected.
>>>>
>>>> The sequence found at 
>>>> ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/... is the 
>>>> mouse reference genome sequence, and it came from sequencing mouse 
>>>> DNA.  The sequence in knownGeneMrna.txt is based mRNA and protein 
>>>> sequence from several sources (click on the blue "UCSC Genes" link 
>>>> on http://genome.ucsc.edu/cgi-bin/hgTracks to read more about how 
>>>> this file was created).  The knownGeneMrna sequence is aligned to 
>>>> the genomic sequence using BLAT.  The single base differences are 
>>>> SNPs, and the different exon start/end positions are a result of 
>>>> mRNA sequence not aligning to the genome, for instance, when there 
>>>> is a polyA tail on the mRNA.
>>>>
>>>> If you need mRNA sequence, I suggest using the knownGeneMrna.txt 
>>>> sequence rather than the genomic sequence.
>>>>
>>>> I hope this is helpful.  If you have further questions, please feel 
>>>> free to contact us again at [email protected].
>>>>
>>>> -- 
>>>> Brooke Rhead
>>>> UCSC Genome Bioinformatics Group
>>>>
>>>>
>>>>
>>>>
>>>> On 02/07/11 05:00, Marten Jäger wrote:
>>>>> Hi,
>>>>>
>>>>> I downloaded the chromosomal sequences 
>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/...) and 
>>>>> the Database files 
>>>>> (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/) for 
>>>>> knownGene.txt and knownGeneMrna.txt from UCSC. Using the 
>>>>> chromosomal locations for the exons using knownGene.txt I 
>>>>> extracted the mRNA Sequences for the knownGenes and compared them 
>>>>> to the sequences in knownGeneMrna.txt. Unfortunately about 1/4 of 
>>>>> the sequences differ in single nucleotide mutations
>>>>>
>>>>> substitution: uc008wki.1
>>>>>
>>>>> ...cctcctAtactggagct...
>>>>> ...cctcctGtactggagct...
>>>>>
>>>>> or different exon start/end positions:
>>>>>
>>>>> start: uc008wjb.1
>>>>>
>>>>> cggcgtgggactgggagtccgtcc...
>>>>>    gcgtgggactgggagtccgtccgg...
>>>>>
>>>>> end: uc008wkk.1
>>>>>
>>>>> ...gatttttttaaccataaaaaaaaaaaaaaaaaaaaaaaaaa
>>>>> ...gatttttttaaccata
>>>>>
>>>>>
>>>>> Can anyone please explain these differences and/or give me a hint 
>>>>> which data to use (I'm looking for motifs in the processed mRNA).
>>>>>
>>>>> Many Thanks.
>>>>>
>>>>> Marten
>>>>>
>>>>>
>>>
>

-- 
Marten Jäger, Msc Bioinformatik
Charité - Universitätsmedizin Berlin
Campus Virchow Klinikum
Institut für Medizinische Genetik und Humangenetik
Augustenburger Platz 1
13353 Berlin
Germany
phone:  +49/30/450 569135
email:  [email protected]
http://genetik.charite.de/institut/
http://compbio.charite.de

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Differences in UCSC DB and chr?.fa files

Reply via email to