Thanks Raghu and Namit. This works. Thanks, Rakesh
-----Original Message----- From: Raghu Murthy [mailto:[email protected]] Sent: Monday, June 29, 2009 5:22 PM To: [email protected] Subject: Re: Set difference in Hive Do you care about getting duplicate 'user' values in the result? If not, you could just get rid of duplicates before doing the join. select a.user from (select distinct user from page_views) a LEFT OUTER JOIN users b ON (a.user = b.user) WHERE b.user is NULL Note that the join runs out of memory in the reducer only if the number of duplicates is large, not just when the tables are large. On 6/29/09 5:11 PM, "Rakesh Setty" <[email protected]> wrote: > I tried this. Unfortunately, both tables are large. > > Thanks, > Rakesh > > > > From: Namit Jain [mailto:[email protected]] > Sent: Monday, June 29, 2009 5:05 PM > To: [email protected] > Subject: RE: Set difference in Hive > > The tables can be large - > > For a given key, have the table with the most number of values as the > rightmost table. > > The problem only happens when both the tables have keys with large number of > values. > > > Thanks, > -namit > > > > From: Rakesh Setty [mailto:[email protected]] > Sent: Monday, June 29, 2009 4:43 PM > To: [email protected] > Subject: RE: Set difference in Hive > > Thanks very much. But the reducer hangs with the warning WARN > org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than > joinEmitInterval rows for join key [] > Both the tables are large and as Zheng mentions at > http://www.mail-archive.com/[email protected]/msg00640.html, large > size for table 0 is a problem. Is there any way to overcome this? > > Thanks, > Rakesh > > > > From: Peter Skomoroch [mailto:[email protected]] > Sent: Monday, June 29, 2009 4:20 PM > To: [email protected] > Subject: Re: Set difference in Hive > > Here is an example of what Amr mentioned from one of my Hive scripts, returns > the set of pages not in "daily_pagecounts_table" > > select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews > FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON > (dt.page_id = dp.page_id) > where dp.page_id is NULL > > On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[email protected]> wrote: > > > do an outer join on user and filter on name.user is null > > -- amr > > > > Rakesh Setty wrote: > > Hi, > > > > I am new to Hive. I would like to know what is the easiest way to > get the difference between two sets. For example, how can I convert the > following SQL query to Hive? > > > > select user from page_views where user not in (select name from users); > > > > Thanks, > > Rakesh > > >
