Hi Pavel, I'm sorry for the delayed response.
Last week, when I first tried to retrieve the rn4 Known Genes sequences via the table browser, the results I got were much smaller than expected (as you had observed). When I looked at the results, which I had saved to a file, I realized that the file looked truncated; the results were alphabetized and seemed truncated after the first few genes starting with the letter 'n.' About two hours later, I tried again, and was able to get results with a number of rows much closer to what I expected. I had expected results with about 8202 sequences (I think you number of 8203 includes the field header line because there are only 8202 records in the rn4 konwnGene table), and my results contained 8124 sequences, which did include a sequence for the example you provided in your email, NM_022920. So, I think that you may have been using the table browser at a very busy time. I believe that trying again at a different time would probably solve the issue. That said, the mRNA sequences for the rn4 Known Genes track are in a table called knownGeneMrna (which has 8124 rows). Therefore, an easier way to get all the sequences from the knownGenes track is to download the knownGeneMrna.txt.gz file from our download server. You'll find the file here: http://hgdownload.cse.ucsc.edu/goldenPath/rn4/database/ To help understand why there are 8124 sequences rather than 8202 (the number of rows in the knownGene table), we first need to understand more about how the rn4 Known Genes track was created. From the Methods section of the track description (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=rn4&g=knownGene): "UniProt protein sequences (including alternative splicing isoforms) and mRNA sequences from RefSeq and GenBank were aligned against the base genome using BLAT. RefSeq alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. GenBank mRNA alignments having a base identity level within 0.2% of the best and at least 97% base identity with the genomic sequence were kept. Protein alignments having a base identity level within 0.2% of the best and at least 80% base identity with the genomic sequence were kept." Because some of the sequences aligned to the genome with a high base identity in multiple locations, multiple positions for these genes were retained in the track. For these genes, there is a single mRNA sequence and two (or more) locations/records in the knownGene table. So, 8124 sequences were mapped to the genome, resulting in 8202 locations in the Known Gene track. Please contact the mail list ([email protected]) again if you have any further questions. Katrina Learned UCSC Genome Bioinformatics Group Pavel Morozov wrote, On 05/06/11 08:57: > Hi, > I am trying to get the knownGenes for the rn4 using table browser. When I > download "all fields from selected table" I am getting table > with 8203 rows and when I am asking for mRNA sequences I am getting just > 3791sequences. > I am trying to look what is missed and it looks like missed sequences are > bona fide genes and indeed should have mRNA. > For example NM_022920 has 12 exons and should be presented, moreover I can > get this mRNA when I do it from the genomic browser. > > Cay you help with this? > > Thank you, > Pavel. > -------------------------------------------------------- > Pavel Morozov, Ph.D. > HHMI and Rockefeller University > Tuschl Lab. > _______________________________________________ > Genome maillist [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
