Hi Namit,

Hourly_fact is partitioned on dt and hr.

Marc
On Oct 3, 2010 10:00 PM, "Namit Jain" <nj...@facebook.com> wrote:
> What is your table hourly_fact partitioned on ?
>
> ________________________________________
> From: Marc Limotte [mslimo...@gmail.com]
> Sent: Friday, October 01, 2010 2:10 PM
> To: hive-user@hadoop.apache.org
> Subject: hive query doesn't seem to limit itself to partitions based on
the WHERE clause
>
> Hi,
>
> From looking at the hive log output, it seems that my job is accessing
many more partitions than it needs to? For example, my query is something
like:
>
> INSERT OVERWRITE TABLE daily_fact
> PARTITION (dt='2010-09-29')
> SELECT
> 20100929 as stamp,
> tagtype,
> country,
> sum(num_requests) AS num_requests
> FROM
> hourly_fact HF
> WHERE
> (HF.dt = '2010-09-29' AND HF.hr > '07' )
> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
> GROUP BY
> 20100929, tagtype, country
>
> Based on the WHERE clause, I would expect it to look only at partitions in
the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log
contains entries like:
>
> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
>
> And many other hours outside my WHERE constraint. I assume this means that
it's processing those directories. The answer still comes out right, but I'm
concerned about the performance.
>
> Would appreciate some help understanding what this means and how to fix
it.
>
> Thanks,
> Marc
>
>

Reply via email to