Re: [Genome] Discrepancies in number between table and sequences for rn4

Katrina Learned Tue, 10 May 2011 10:54:35 -0700

Hi Pavel,

I'm sorry for the delayed response.


Last week, when I first tried to retrieve the rn4 Known Genes sequences 
via the table browser, the results I got were much smaller than expected 
(as you had observed). When I looked at the results, which I had saved 
to a file, I realized that the file looked truncated; the results were 
alphabetized and seemed truncated after the first few genes starting 
with the letter 'n.' About two hours later, I tried again, and was able 
to get results with a number of rows much closer to what I expected. I 
had expected results with about 8202 sequences (I think you number of 
8203 includes the field header line because there are only 8202 records 
in the rn4 konwnGene table), and my results contained 8124 sequences, 
which did include a sequence for the example you provided in your email, 
NM_022920. So, I think that you may have been using the table browser at 
a very busy time. I believe that trying again at a different time would 
probably solve the issue.

That said, the mRNA sequences for the rn4 Known Genes track are in a 
table called knownGeneMrna (which has 8124 rows). Therefore, an easier 
way to get all the sequences from the knownGenes track is to download 
the knownGeneMrna.txt.gz file from our download server. You'll find the 
file here: http://hgdownload.cse.ucsc.edu/goldenPath/rn4/database/

To help understand why there are 8124 sequences rather than 8202 (the 
number of rows in the knownGene table), we first need to understand more 
about how the rn4 Known Genes track was created. From the Methods 
section of the track description 
(http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=rn4&g=knownGene):

"UniProt protein sequences (including alternative splicing isoforms) and 
mRNA sequences from RefSeq and GenBank were aligned against the base 
genome using BLAT. RefSeq alignments having a base identity level within 
0.1% of the best and at least 96% base identity with the genomic 
sequence were kept. GenBank mRNA alignments having a base identity level 
within 0.2% of the best and at least 97% base identity with the genomic 
sequence were kept. Protein alignments having a base identity level 
within 0.2% of the best and at least 80% base identity with the genomic 
sequence were kept."

Because some of the sequences aligned to the genome with a high base 
identity in multiple locations, multiple positions for these genes were 
retained in the track. For these genes, there is a single mRNA sequence 
and two (or more) locations/records in the knownGene table. So, 8124 
sequences were mapped to the genome, resulting in 8202 locations in the 
Known Gene track.

Please contact the mail list ([email protected]) again if you have any 
further questions.

Katrina Learned
UCSC Genome Bioinformatics Group


Pavel Morozov wrote, On 05/06/11 08:57:
> Hi,
> I am trying to get the knownGenes for the rn4 using table browser. When I 
> download "all fields from selected table" I am getting table
> with 8203 rows and when I am asking for mRNA sequences I am getting just 
> 3791sequences.
> I am trying to look what is missed and it looks like missed sequences are 
> bona fide genes and indeed should have mRNA.
> For example NM_022920 has 12 exons and should be presented, moreover I can 
> get this mRNA when I do it from the genomic browser.
>
> Cay you help with this?
>
> Thank you,
> Pavel.
> --------------------------------------------------------
> Pavel Morozov, Ph.D.
> HHMI and Rockefeller University
> Tuschl Lab.
> _______________________________________________
> Genome maillist  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>    
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Discrepancies in number between table and sequences for rn4

Reply via email to