Hive QL - NOT IN, NOT EXIST
Hi, I am trying to write a hive query to find the equivalent of NOT IN / NOT EXIST in SQL. However, Hive does not support this. It does have Left Semi Join which serves as IN but NOT does not support. I am wondering if there is any way to do this without resorting to using left outer join and finding nulls. Example Query in SQL: SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);
Re: Hive QL - NOT IN, NOT EXIST
--- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote: I am wondering if there is any way to do this without resorting to using left outer join and finding nulls. I have found this to be an acceptable substitute. Is it not working for you?
RE: Hive QL - NOT IN, NOT EXIST
It works but it takes a very long time because the subqueries in NOT IN contains 400 million rows (the message table in the example) and the feed table contains 3 million rows. SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message); Date: Sun, 5 May 2013 20:25:15 -0700 From: michaelma...@yahoo.com Subject: Re: Hive QL - NOT IN, NOT EXIST To: user@hive.apache.org --- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote: I am wondering if there is any way to do this without resorting to using left outer join and finding nulls. I have found this to be an acceptable substitute. Is it not working for you?
Re: Hive QL - NOT IN, NOT EXIST
@Peter Does the query plan demonstrate that the 3Meg row table is being map-joined and the 400M table streamed through? That is what you want: but you might either need to fiddle with hints to get it to happen Details: Read uuids s of feed into in-memory map on all nodes (mapjoin) Stream the 400M message records through the in-memory maps, copying id's from the all feed uuids map to a matched feed uuid's map for entries that have matches in the messages Note: this way the 400M rows are only read once on the cluster. You can see whether hive can manage this or if you write a custom m/r job to do it. 2013/5/5 Peter Chu pete@outlook.com It works but it takes a very long time because the subqueries in NOT IN contains 400 million rows (the message table in the example) and the feed table contains 3 million rows. SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message); Date: Sun, 5 May 2013 20:25:15 -0700 From: michaelma...@yahoo.com Subject: Re: Hive QL - NOT IN, NOT EXIST To: user@hive.apache.org --- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote: I am wondering if there is any way to do this without resorting to using left outer join and finding nulls. I have found this to be an acceptable substitute. Is it not working for you?
RE: Hive QL - NOT IN, NOT EXIST
Thanks, Stephen, I do not quite understand what you mean by Stream, specifically Stream the 400M message records through the in-memory maps.Can you please elaborate. Also, can you use MAPJOIN on left outer join? Peter Date: Sun, 5 May 2013 21:44:37 -0700 Subject: Re: Hive QL - NOT IN, NOT EXIST From: java...@gmail.com To: user@hive.apache.org @Peter Does the query plan demonstrate that the 3Meg row table is being map-joined and the 400M table streamed through? That is what you want: but you might either need to fiddle with hints to get it to happen Details:Read uuids s of feed into in-memory map on all nodes (mapjoin) Stream the 400M message records through the in-memory maps, copying id's from the all feed uuids map to a matched feed uuid's map for entries that have matches in the messages Note: this way the 400M rows are only read once on the cluster. You can see whether hive can manage this or if you write a custom m/r job to do it. 2013/5/5 Peter Chu pete@outlook.com It works but it takes a very long time because the subqueries in NOT IN contains 400 million rows (the message table in the example) and the feed table contains 3 million rows. SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message); Date: Sun, 5 May 2013 20:25:15 -0700 From: michaelma...@yahoo.com Subject: Re: Hive QL - NOT IN, NOT EXIST To: user@hive.apache.org --- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote: I am wondering if there is any way to do this without resorting to using left outer join and finding nulls. I have found this to be an acceptable substitute. Is it not working for you?
Re: HIVE-3979 in Hive 0.11
Hi John, This is a mistake in the release notes. It will be fixed in the next 0.11 release candidate. Thanks. Carl On Sat, May 4, 2013 at 6:18 AM, John Omernik j...@omernik.com wrote: I see in the release notes for HIVE -3979 [HIVE-3979 https://issues.apache.org/jira/browse/HIVE-3979] - Provide syntax for unescaped regex on rlike, and other regexp_* functions Yet when I click on that JIRA there are not notes etc. Could it be that this was included by mistake? I am curious, if this isn't a mistake, how this was implemented. Thanks! John
Re: [VOTE] Apache Hive 0.11.0 Release Candidate 1
I took a quick look. Here's a list of things I noticed: * NOTICE ** Copyright is out of date. * README.txt ** Add one more '=' to the title underline. ** Requirements section only lists Hadoop 0.20.x ** Will it work with Java 1.7? * RELEASE_NOTES.txt ** The 0.11 section lists tickets that aren't marked as resolved/fixed on JIRA, and which are not included in the git commit log (e.g. HIVE-3979). This may also extend to the 0.10 release notes. Related question: How were these release notes generated? * docs/index.html ** Update email list links (e.g. hive-u...@hadoop.apache.org) * src/build.properties ** year=2012 * src/eclipse-files/.classpath ** The guava jar is not in the right location. This breaks the eclipse import. Thanks. Carl On Tue, Apr 30, 2013 at 5:18 PM, Ashutosh Chauhan hashut...@apache.orgwrote: Hey all, Based on feedback from folks, I have respun release candidate, RC1. Please take a look. It basically fixes the size bloat of tarball. Source tag for RC1 is at: https://svn.apache.org/repos/asf/hive/tags/release-0.11.0-rc1 Source tar ball and convenience binary artifacts can be found at:http://people.apache.org/~hashutosh/hive-0.11.0-rc1/ Maven artifacts for hive are available at:https://repository.apache.org/content/repositories/orgapachehive-158/ Maven artifacts for hcatalog are available at: https://repository.apache.org/content/repositories/orgapachehcatalog-159/ This release has many goodies including HiveServer2, integrated hcatalog, windowing and analytical functions, decimal data type, better query planning, performance enhancements and various bug fixes. In total, we resolved more than 350 issues. Full list of fixed issues can be found at: http://s.apache.org/8Fr Voting will conclude in 72 hours. Hive PMC Members: Please test and vote. Thanks, Ashutosh (On behalf of Hive contributors who made 0.11 a possibility)