On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <mslimo...@gmail.com> wrote:
> Hi Namit,
>
> Hourly_fact is partitioned on dt and hr.
>
> Marc
>
> On Oct 3, 2010 10:00 PM, "Namit Jain" <nj...@facebook.com> wrote:
>> What is your table hourly_fact partitioned on ?
>>
>> ________________________________________
>> From: Marc Limotte [mslimo...@gmail.com]
>> Sent: Friday, October 01, 2010 2:10 PM
>> To: hive-user@hadoop.apache.org
>> Subject: hive query doesn't seem to limit itself to partitions based on
>> the WHERE clause
>>
>> Hi,
>>
>> From looking at the hive log output, it seems that my job is accessing
>> many more partitions than it needs to? For example, my query is something
>> like:
>>
>> INSERT OVERWRITE TABLE daily_fact
>> PARTITION (dt='2010-09-29')
>> SELECT
>> 20100929 as stamp,
>> tagtype,
>> country,
>> sum(num_requests) AS num_requests
>> FROM
>> hourly_fact HF
>> WHERE
>> (HF.dt = '2010-09-29' AND HF.hr > '07' )
>> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
>> GROUP BY
>> 20100929, tagtype, country
>>
>> Based on the WHERE clause, I would expect it to look only at partitions in
>> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log
>> contains entries like:
>>
>> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
>> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
>>
>> And many other hours outside my WHERE constraint. I assume this means that
>> it's processing those directories. The answer still comes out right, but I'm
>> concerned about the performance.
>>
>> Would appreciate some help understanding what this means and how to fix
>> it.
>>
>> Thanks,
>> Marc
>>
>>
>
Possibly you defined HF.hr <= '07'  as an int column and comparing it
as a string is resulting in a full table scan. Can you explain the
query?

Reply via email to