Hello Dr. Neubauer, Unfortunately we do not have statistics of this nature already generated, although there are some other statistics available for certain assemblies here: http://genome.ucsc.edu/goldenPath/stats.html
Another source of pre-computed statistics is the original source (NCBI). A link to the release notes for a genome assembly is in the Credit's section of the Gateway page (http://genome.ucsc.edu -> Genome Browser -> choose assembly -> scroll to bottom of page). Counts from UCSC will not match for annotation with that available from NCBI for many data types. This is expected and is due to the many ways these statistics can be generated. There is no one "right" answer. To add these up yourself, we suggest that you use the Table browser. There are likely as many ways to add up these statistics are there are researchers, as criteria will change depending on what the stats are being used for, but we can offer a place to start. I am going to assume that you are generating statistics for human to keep this simple. If you need statistics for other assemblies, some modifications may be necessary (use different tracks) and some stats may not be available at all due to the level of annotation present in the assembly's track set. I am also going to assume that you need to use web tools (not flat text files or mySQL). If you are able to use those tools, the basic logic of the queries below can be adapted by you. And finally, you may need to stay with hg18 (not hg19) in order to capture all of the statistics as hg19 does not have all of the same annotation tracks, or you can create a mixed set of statistics. Table browser: http://genome.ucsc.edu/cgi-bin/hgTables User's guide: http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html We hope this helps you to get started. We encourage you to explore the complete track set for each assembly, in addition to those suggested, and to tune the queries as you see fit. Thank you for your patience while we developed a reply, Jennifer 1) protein coding genes Choose a Gene and Gene prediction track (UCSC Genes would be a good choice). Since this track includes coding and non-coding transcripts, you will need to filter for two things a) presence of coding region b) collapse transcripts by cluster (gene). Gene and Gene Prediction tracks -> UCSC Genes -> UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics Basic query path (hg18 or hg19) i) In Table browser, set to target assembly and the UCSC Genes track, and make sure region = genome. ii) Use filter function. To designate non-coding genes, the primary table for this track (knownGene) has the cdsStart equal to the cdsEnd. Enter the reverse (to exclude non-coding genes) into the free-form mySQL query like this with no quotes "cdsStart != cdsEnd" iii) set output = select fields from primary and related tables, name file, and click on "get output" iv) on the next form, check the table "knownIsoforms" and click on "Allow selection from Checked Tables". Now this table will come up and you can capture specific fields, you can add any you want, but the clusterID (which can be interpreted as a gene) should be included. v) click on "get output" under the top table vi) the file downloaded will include all transcripts with a defined coding region. Cut out the clusterID and count it uniquely to determine the number of "protein coding genes". Specifically, "the number of genes that contain one or more transcript that has a defined coding region". Note: You may want to send the data output to Galaxy to cut out the cluster column and count it up uniquely. Or stick with using your own tools with the downloaded file. 2) size of regulatory regions This question can be answered many ways, depending on the prediction method(s) you trust. Look under the track group "Regulation" and review the tracks. The track ORegAnno may be a good source to start with, as it is literature based. But this also means that it may fail to capture all "potential regions" that some of the more experimental tracks may suggest. Regulation -> ORegAnno - Regulatory elements from ORegAnno Basic query path (hg18) i) In Table browser, set up for your assembly and track of interest as for #1 above ii) Download data as-is to use Excel (for example) or send to Galaxy (check "Galaxy" box in Table browser when defining output destination). Once data is in Galaxy, calculate the length of each track entry (using text manipulation tools). Summarize length data using statistical tools/methods as desired, after downloading. Galaxy also offers from options for this type of summary. iii) file format is important to understand. Compare the track methods, with the Table browser file type (use "describe table schema" button"), and compare to our FAQ: http://genome.ucsc.edu/FAQ/FAQformat.html 3) number of miRNA genes There are a few tracks that contain data, but the track Gene Prediction -> sno/miRNA is a good initial choice. Basic query path (hg18) i) In Table browser, set up for your assembly and track of interest as for #1 above ii) use a filter as in #1 on the primary table, hgRna, using the list of table fields and setting "type" = "miRNA" (do not use quotes "" when you type/paste in the boxes). iii) output file and add up lines (occurrences of miRNA annotated regions). Species note: these are individual annotations, not correlated to gene bounds (clusters) from UCSC Genes. You may want to save the data from this query as a custom track and perform and intersection with the UCSC Genes track and attempt to collapse by gene bound or to see if there is any overlap. I will leave exactly how to accomplish this up to you. Keep in mind that sending data over to Galaxy or downloading and using your own tools to merge/manipulate the data is likely to be required. The Table browser cannot do all calculations. 4) number of alternative transcripts This can be easy to add up or more complicated. The easy way is to simply note the number of transcripts (rows) in the knownGene table. Gene and Gene Prediction tracks -> UCSC Genes -> UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics Basic query path (hg18 or hg19) i) The simple count can be found by doing #1, step i), and with the default table for UCSC Genes selected, click on "describe table schema". ii) the number of items in the table is the number of variant transcripts contained in the track The "Alt Events" track would also be an option, if you wanted a break-down about the types. The basic query path is similar to those above. Gene and Gene Prediction tracks -> Alt Events -> Alternative Splicing, Alternative Promoter and Similar Events in UCSC Genes --------------------------------- Jennifer Jackson UCSC Genome Informatics Group http://genome.ucsc.edu/ On 5/12/10 11:43 AM, Raymond Neubauer wrote: > Dear Folks, > > I am looking for summary information per species on 1) total number of > protein coding genes in its genome, 2) size of regulatory regions, 3) number > of miRNA genes, 4) number of alternative transcripts > > How can I get any or all of this summary information from your data set? > > > Sincerely, > > > Dr. Raymond Neubauer > > > > ************************************** > > Dr. Raymond L. Neubauer > Senior Lecturer > Biological Sciences > > Office: Painter 1.06B > Spring, 2010 Office Hours: > By Appointment - send an e-mail > > 512/471-4741 > > Campus Mail: > Bio. Labs > A6700 > > Regular Mail: > Molecular and Developmental Biology > Bio. Labs Room 311 > A6700 > University of Texas > Austin, Texas 78712 > ************************************* > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
