RE: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause

Steven Wong Wed, 06 Oct 2010 17:06:14 -0700

What Hive version are you running? Try an "explain extended" on your insert 
query and see if unneeded partitions are included.


Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT) is 
UTC-07:00. To convert UTC to PDT, the condition should be:
(HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND HF.hr < 
'07' )
instead of:
(HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr <= 
'07' )

Good luck on the days we spring forward or fall back. :)/:(


From: Marc Limotte [mailto:mslimo...@gmail.com]
Sent: Wednesday, October 06, 2010 11:12 AM
To: hive-user@hadoop.apache.org
Subject: Re: RE: hive query doesn't seem to limit itself to partitions based on 
the WHERE clause

Thanks for the response, Edward.

The source table (hourly_fact) is partitioned on dt (date) and hr (hour), and 
I've confirmed that they are both String fields (CREATE stmt is below).

The hourly_fact table contains 'number of requests' for each hour by a few 
dimensions.  The query is just trying to get a daily aggregation across those 
same dimensions.  The only trick is that the hourly_fact table has dt and hour 
in UTC time.  And the daily aggregation is being done for a PST (pacific std) 
day, hence the 7 hour offset.
CREATE TABLE IF NOT EXISTS hourly_fact (
  tagtype               STRING,
  country               STRING,
  company               INT,
  request_keyword       STRING,
  receiver_code         STRING,
  referrer_domain       STRING,
  num_requests          INT,
  num_new_user_requests INT
)
PARTITIONED BY (dt STRING, hr STRING)
ROW FORMAT DELIMITED
STORED AS SEQUENCEFILE
LOCATION "...";

Marc
On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo 
<edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>> wrote:
On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte 
<mslimo...@gmail.com<mailto:mslimo...@gmail.com>> wrote:
> Hi Namit,
>
> Hourly_fact is partitioned on dt and hr.
>
> Marc
>
> On Oct 3, 2010 10:00 PM, "Namit Jain" 
> <nj...@facebook.com<mailto:nj...@facebook.com>> wrote:
>> What is your table hourly_fact partitioned on ?
>>
>> ________________________________________
>> From: Marc Limotte [mslimo...@gmail.com<mailto:mslimo...@gmail.com>]
>> Sent: Friday, October 01, 2010 2:10 PM
>> To: hive-user@hadoop.apache.org<mailto:hive-user@hadoop.apache.org>
>> Subject: hive query doesn't seem to limit itself to partitions based on
>> the WHERE clause
>>
>> Hi,
>>
>> From looking at the hive log output, it seems that my job is accessing
>> many more partitions than it needs to? For example, my query is something
>> like:
>>
>> INSERT OVERWRITE TABLE daily_fact
>> PARTITION (dt='2010-09-29')
>> SELECT
>> 20100929 as stamp,
>> tagtype,
>> country,
>> sum(num_requests) AS num_requests
>> FROM
>> hourly_fact HF
>> WHERE
>> (HF.dt = '2010-09-29' AND HF.hr > '07' )
>> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
>> GROUP BY
>> 20100929, tagtype, country
>>
>> Based on the WHERE clause, I would expect it to look only at partitions in
>> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log
>> contains entries like:
>>
>> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
>> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
>>
>> And many other hours outside my WHERE constraint. I assume this means that
>> it's processing those directories. The answer still comes out right, but I'm
>> concerned about the performance.
>>
>> Would appreciate some help understanding what this means and how to fix
>> it.
>>
>> Thanks,
>> Marc
>>
>>
>
Possibly you defined HF.hr <= '07'  as an int column and comparing it
as a string is resulting in a full table scan. Can you explain the
query?

RE: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause

Reply via email to