What Hive version are you running? Try an "explain extended" on your insert query and see if unneeded partitions are included.
Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT) is UTC-07:00. To convert UTC to PDT, the condition should be: (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND HF.hr < '07' ) instead of: (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) Good luck on the days we spring forward or fall back. :)/:( From: Marc Limotte [mailto:mslimo...@gmail.com] Sent: Wednesday, October 06, 2010 11:12 AM To: hive-user@hadoop.apache.org Subject: Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause Thanks for the response, Edward. The source table (hourly_fact) is partitioned on dt (date) and hr (hour), and I've confirmed that they are both String fields (CREATE stmt is below). The hourly_fact table contains 'number of requests' for each hour by a few dimensions. The query is just trying to get a daily aggregation across those same dimensions. The only trick is that the hourly_fact table has dt and hour in UTC time. And the daily aggregation is being done for a PST (pacific std) day, hence the 7 hour offset. CREATE TABLE IF NOT EXISTS hourly_fact ( tagtype STRING, country STRING, company INT, request_keyword STRING, receiver_code STRING, referrer_domain STRING, num_requests INT, num_new_user_requests INT ) PARTITIONED BY (dt STRING, hr STRING) ROW FORMAT DELIMITED STORED AS SEQUENCEFILE LOCATION "..."; Marc On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>> wrote: On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <mslimo...@gmail.com<mailto:mslimo...@gmail.com>> wrote: > Hi Namit, > > Hourly_fact is partitioned on dt and hr. > > Marc > > On Oct 3, 2010 10:00 PM, "Namit Jain" > <nj...@facebook.com<mailto:nj...@facebook.com>> wrote: >> What is your table hourly_fact partitioned on ? >> >> ________________________________________ >> From: Marc Limotte [mslimo...@gmail.com<mailto:mslimo...@gmail.com>] >> Sent: Friday, October 01, 2010 2:10 PM >> To: hive-user@hadoop.apache.org<mailto:hive-user@hadoop.apache.org> >> Subject: hive query doesn't seem to limit itself to partitions based on >> the WHERE clause >> >> Hi, >> >> From looking at the hive log output, it seems that my job is accessing >> many more partitions than it needs to? For example, my query is something >> like: >> >> INSERT OVERWRITE TABLE daily_fact >> PARTITION (dt='2010-09-29') >> SELECT >> 20100929 as stamp, >> tagtype, >> country, >> sum(num_requests) AS num_requests >> FROM >> hourly_fact HF >> WHERE >> (HF.dt = '2010-09-29' AND HF.hr > '07' ) >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) >> GROUP BY >> 20100929, tagtype, country >> >> Based on the WHERE clause, I would expect it to look only at partitions in >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log >> contains entries like: >> >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 >> >> And many other hours outside my WHERE constraint. I assume this means that >> it's processing those directories. The answer still comes out right, but I'm >> concerned about the performance. >> >> Would appreciate some help understanding what this means and how to fix >> it. >> >> Thanks, >> Marc >> >> > Possibly you defined HF.hr <= '07' as an int column and comparing it as a string is resulting in a full table scan. Can you explain the query?