Thanks for the response, Edward. The source table (hourly_fact) is partitioned on dt (date) and hr (hour), and I've confirmed that they are both String fields (CREATE stmt is below).
The hourly_fact table contains 'number of requests' for each hour by a few dimensions. The query is just trying to get a daily aggregation across those same dimensions. The only trick is that the hourly_fact table has dt and hour in UTC time. And the daily aggregation is being done for a PST (pacific std) day, hence the 7 hour offset. CREATE TABLE IF NOT EXISTS hourly_fact ( tagtype STRING, country STRING, company INT, request_keyword STRING, receiver_code STRING, referrer_domain STRING, num_requests INT, num_new_user_requests INT ) PARTITIONED BY (dt STRING, hr STRING) ROW FORMAT DELIMITED STORED AS SEQUENCEFILE LOCATION "..."; Marc On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <mslimo...@gmail.com> wrote: > > Hi Namit, > > > > Hourly_fact is partitioned on dt and hr. > > > > Marc > > > > On Oct 3, 2010 10:00 PM, "Namit Jain" <nj...@facebook.com> wrote: > >> What is your table hourly_fact partitioned on ? > >> > >> ________________________________________ > >> From: Marc Limotte [mslimo...@gmail.com] > >> Sent: Friday, October 01, 2010 2:10 PM > >> To: hive-user@hadoop.apache.org > >> Subject: hive query doesn't seem to limit itself to partitions based on > >> the WHERE clause > >> > >> Hi, > >> > >> From looking at the hive log output, it seems that my job is accessing > >> many more partitions than it needs to? For example, my query is > something > >> like: > >> > >> INSERT OVERWRITE TABLE daily_fact > >> PARTITION (dt='2010-09-29') > >> SELECT > >> 20100929 as stamp, > >> tagtype, > >> country, > >> sum(num_requests) AS num_requests > >> FROM > >> hourly_fact HF > >> WHERE > >> (HF.dt = '2010-09-29' AND HF.hr > '07' ) > >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) > >> GROUP BY > >> 20100929, tagtype, country > >> > >> Based on the WHERE clause, I would expect it to look only at partitions > in > >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the > log > >> contains entries like: > >> > >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file > >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > >> > >> And many other hours outside my WHERE constraint. I assume this means > that > >> it's processing those directories. The answer still comes out right, but > I'm > >> concerned about the performance. > >> > >> Would appreciate some help understanding what this means and how to fix > >> it. > >> > >> Thanks, > >> Marc > >> > >> > > > Possibly you defined HF.hr <= '07' as an int column and comparing it > as a string is resulting in a full table scan. Can you explain the > query? >