I tried this. Unfortunately, both tables are large. Thanks, Rakesh
________________________________ From: Namit Jain [mailto:[email protected]] Sent: Monday, June 29, 2009 5:05 PM To: [email protected] Subject: RE: Set difference in Hive The tables can be large - For a given key, have the table with the most number of values as the rightmost table. The problem only happens when both the tables have keys with large number of values. Thanks, -namit From: Rakesh Setty [mailto:[email protected]] Sent: Monday, June 29, 2009 4:43 PM To: [email protected] Subject: RE: Set difference in Hive Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key [] Both the tables are large and as Zheng mentions at http://www.mail-archive.com/[email protected]/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this? Thanks, Rakesh ________________________________ From: Peter Skomoroch [mailto:[email protected]] Sent: Monday, June 29, 2009 4:20 PM To: [email protected] Subject: Re: Set difference in Hive Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table" select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id) where dp.page_id is NULL On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[email protected]<mailto:[email protected]>> wrote: do an outer join on user and filter on name.user is null -- amr Rakesh Setty wrote: Hi, I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive? select user from page_views where user not in (select name from users); Thanks, Rakesh -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
