Hey Vijay,

You can go to the mapred ui, normally it runs on port 50030 of the namenode
and see how many map jobs got created for your submitted query.

You said that the events table has daily partitions but the example query
that you have does not prune the partitions by specifying a WHERE clause. So
I have the following questions
1) how big is the table (you can just do a hadoop dfs -dus
<hdfs-dir-for-table> ? how many partitions ?
2) do you really intend to count the number of events across all days ?
3) could you build a query which computes over 1-5 day(s) and persists the
data in a separate table for consumption later on ?

Based on your node configuration, I am just guessing the amount of data to
process is too large and hence the high CPU.

Thanks,
Viral

On Thu, Feb 3, 2011 at 12:49 PM, Vijay <tec...@gmail.com> wrote:

> Hi,
>
> The simplest of hive queries seem to be consuming 100% cpu. This is
> with a small 4-node cluster. The machines are pretty beefy (16 cores
> per machine, tons of RAM, 16 M+R maximum tasks configured, 1GB RAM for
> mapred.child.java.opts, etc). A simple query like "select count(1)
> from events" where the events table has daily partitions of log files
> in gzipped file format). While this is probably too generic a question
> and there is a bunch of investigation we need to, are there any
> specific areas for me to look at? Has anyone see anything like this
> before? Also, are there any tools or easy options to profile hive
> query execution?
>
> Thanks in advance,
> Vijay
>

Reply via email to