Re: [Genome] Gene counts

Jennifer Jackson Mon, 17 May 2010 12:01:22 -0700

Hello Dr. Neubauer,

Unfortunately we do not have statistics of this nature already 
generated, although there are some other statistics available for 
certain assemblies here:
http://genome.ucsc.edu/goldenPath/stats.html

Another source of pre-computed statistics is the original source (NCBI). 
A link to the release notes for a genome assembly is in the Credit's 
section of the Gateway page (http://genome.ucsc.edu -> Genome Browser -> 
choose assembly -> scroll to bottom of page). Counts from UCSC will not 
match for annotation with that available from NCBI for many data types. 
This is expected and is due to the many ways these statistics can be 
generated. There is no one "right" answer.

To add these up yourself, we suggest that you use the Table browser. 
There are likely as many ways to add up these statistics are there are 
researchers, as criteria will change depending on what the stats are 
being used for, but we can offer a place to start. I am going to assume 
that you are generating statistics for human to keep this simple. If you 
need statistics for other assemblies, some modifications may be 
necessary (use different tracks) and some stats may not be available at 
all due to the level of annotation present in the assembly's track set. 
I am also going to assume that you need to use web tools (not flat text 
files or mySQL). If you are able to use those tools, the basic logic of 
the queries below can be adapted by you. And finally, you may need to 
stay with hg18 (not hg19) in order to capture all of the statistics as 
hg19 does not have all of the same annotation tracks, or you can create 
a mixed set of statistics.

Table browser: http://genome.ucsc.edu/cgi-bin/hgTables
User's guide: http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html

We hope this helps you to get started. We encourage you to explore the 
complete track set for each assembly, in addition to those suggested, 
and to tune the queries as you see fit.

Thank you for your patience while we developed a reply,
Jennifer

1) protein coding genes

Choose a Gene and Gene prediction track (UCSC Genes would be a good 
choice). Since this track includes coding and non-coding transcripts, 
you will need to filter for two things a) presence of coding region b) 
collapse transcripts by cluster (gene).

Gene and Gene Prediction tracks -> UCSC Genes -> UCSC Genes Based on 
RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

Basic query path (hg18 or hg19)
i) In Table browser, set to target assembly and the UCSC Genes track, 
and make sure region = genome.
ii) Use filter function. To designate non-coding genes, the primary 
table for this track (knownGene) has the cdsStart equal to the cdsEnd. 
Enter the reverse (to exclude non-coding genes) into the free-form mySQL 
query like this with no quotes "cdsStart != cdsEnd"
iii) set output = select fields from primary and related tables, name 
file, and click on "get output"
iv) on the next form, check the table "knownIsoforms" and click on 
"Allow selection from Checked Tables". Now this table will come up and 
you can capture specific fields, you can add any you want, but the 
clusterID (which can be interpreted as a gene) should be included.
v) click on "get output" under the top table
vi) the file downloaded will include all transcripts with a defined 
coding region. Cut out the clusterID and count it uniquely to determine 
the number of "protein coding genes". Specifically, "the number of genes 
that contain one or more transcript that has a defined coding region".

Note: You may want to send the data output to Galaxy to cut out the 
cluster column and count it up uniquely. Or stick with using your own 
tools with the downloaded file.

2) size of regulatory regions

This question can be answered many ways, depending on the prediction 
method(s) you trust. Look under the track group "Regulation" and review 
the tracks. The track ORegAnno may be a good source to start with, as it 
is literature based. But this also means that it may fail to capture all 
"potential regions" that some of the more experimental tracks may suggest.

Regulation -> ORegAnno - Regulatory elements from ORegAnno

Basic query path (hg18)
i) In Table browser, set up for your assembly and track of interest as 
for #1 above
ii) Download data as-is to use Excel (for example) or send to Galaxy 
(check "Galaxy" box in Table browser when defining output destination). 
Once data is in Galaxy, calculate the length of each track entry (using 
text manipulation tools). Summarize length data using statistical 
tools/methods as desired, after downloading. Galaxy also offers from 
options for this type of summary.
iii) file format is important to understand. Compare the track methods, 
with the Table browser file type (use "describe table schema" button"), 
and compare to our FAQ: http://genome.ucsc.edu/FAQ/FAQformat.html

3) number of miRNA genes

There are a few tracks that contain data, but the track
Gene Prediction -> sno/miRNA is a good initial choice.

Basic query path (hg18)
i) In Table browser, set up for your assembly and track of interest as 
for #1 above
ii) use a filter as in #1 on the primary table, hgRna, using the list of 
table fields and setting "type" = "miRNA" (do not use quotes "" when you 
type/paste in the boxes).
iii) output file and add up lines (occurrences of miRNA annotated regions).

Species note: these are individual annotations, not correlated to gene 
bounds (clusters) from UCSC Genes. You may want to save the data from 
this query as a custom track and perform and intersection with the UCSC 
Genes track and attempt to collapse by gene bound or to see if there is 
any overlap. I will leave exactly how to accomplish this up to you. Keep 
in mind that sending data over to Galaxy or downloading and using your 
own tools to merge/manipulate the data is likely to be required. The 
Table browser cannot do all calculations.

4) number of alternative transcripts

This can be easy to add up or more complicated. The easy way is to 
simply note the number of transcripts (rows) in the knownGene table.

Gene and Gene Prediction tracks -> UCSC Genes -> UCSC Genes Based on 
RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

Basic query path (hg18 or hg19)
i) The simple count can be found by doing #1, step i), and with the 
default table for UCSC Genes selected, click on "describe table schema".
ii) the number of items in the table is the number of variant 
transcripts contained in the track

The "Alt Events" track would also be an option, if you wanted a 
break-down about the types. The basic query path is similar to those 
above.  
Gene and Gene Prediction tracks -> Alt Events -> Alternative Splicing, 
Alternative Promoter and Similar Events in UCSC Genes

---------------------------------
Jennifer Jackson
UCSC Genome Informatics Group
http://genome.ucsc.edu/

On 5/12/10 11:43 AM, Raymond Neubauer wrote:
> Dear Folks,
>
> I am looking for summary information per species on 1) total number of
> protein coding genes in its genome, 2) size of regulatory regions, 3) number
> of miRNA genes, 4) number of alternative transcripts
>
> How can I get any or all of this summary information from your data set?
>
>
> Sincerely,
>
>
> Dr. Raymond Neubauer
>
>
>
> **************************************
>
> Dr. Raymond L. Neubauer
> Senior Lecturer
> Biological Sciences
>
> Office: Painter 1.06B
> Spring, 2010 Office Hours:
> By Appointment - send an e-mail
>
> 512/471-4741
>
> Campus Mail:
> Bio. Labs
> A6700
>
> Regular Mail:
> Molecular and Developmental Biology
> Bio. Labs Room 311
> A6700
> University of Texas
> Austin, Texas 78712
> *************************************
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Gene counts

Reply via email to