Hive QL - NOT IN, NOT EXIST

2013-05-05 Thread Peter Chu
Hi, I am trying to write a hive query to find the equivalent of NOT IN / NOT 
EXIST in SQL.
However, Hive does not support this.  It does have Left Semi Join which serves 
as IN but NOT does not support.
I am wondering if there is any way to do this without resorting to using left 
outer join and finding nulls.
Example Query in SQL:
SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message); 
  

Re: Hive QL - NOT IN, NOT EXIST

2013-05-05 Thread Michael Malak

--- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote:

 I am wondering if there is any way to do this without resorting to
 using left outer join and finding nulls.

I have found this to be an acceptable substitute.  Is it not working for you?



RE: Hive QL - NOT IN, NOT EXIST

2013-05-05 Thread Peter Chu
It works but it takes a very long time because the subqueries in NOT IN 
contains 400 million rows (the message table in the example) and the feed table 
contains 3 million rows.
SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);
 Date: Sun, 5 May 2013 20:25:15 -0700
 From: michaelma...@yahoo.com
 Subject: Re: Hive QL - NOT IN, NOT EXIST
 To: user@hive.apache.org
 
 
 --- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote:
 
  I am wondering if there is any way to do this without resorting to
  using left outer join and finding nulls.
 
 I have found this to be an acceptable substitute.  Is it not working for you?
 
  

Re: Hive QL - NOT IN, NOT EXIST

2013-05-05 Thread Stephen Boesch
@Peter  Does the query plan demonstrate that the 3Meg row table is being
map-joined and the 400M table streamed through? That is what you want: but
you might either need to fiddle with hints to get it to happen

Details:
Read uuids s of feed into  in-memory map on all nodes (mapjoin)
Stream the 400M message records through the in-memory maps, copying
id's from the all feed uuids  map to a  matched feed uuid's map for
entries that have matches in the messages

 Note: this way the 400M rows are only read once on the cluster.

You can see whether hive can manage this or if you write a custom m/r job
to do it.



2013/5/5 Peter Chu pete@outlook.com

 It works but it takes a very long time because the subqueries in NOT IN
 contains 400 million rows (the message table in the example) and the feed
 table contains 3 million rows.

 SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);

  Date: Sun, 5 May 2013 20:25:15 -0700
  From: michaelma...@yahoo.com
  Subject: Re: Hive QL - NOT IN, NOT EXIST
  To: user@hive.apache.org

 
 
  --- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote:
 
   I am wondering if there is any way to do this without resorting to
   using left outer join and finding nulls.
 
  I have found this to be an acceptable substitute. Is it not working for
 you?
 



RE: Hive QL - NOT IN, NOT EXIST

2013-05-05 Thread Peter Chu
Thanks, Stephen,
I do not quite understand what you mean by Stream, specifically Stream the 
400M message records through the in-memory maps.Can you please elaborate.
Also, can you use MAPJOIN on left outer join?
Peter

Date: Sun, 5 May 2013 21:44:37 -0700
Subject: Re: Hive QL - NOT IN, NOT EXIST
From: java...@gmail.com
To: user@hive.apache.org


@Peter  Does the query plan demonstrate that the 3Meg row table is being 
map-joined and the 400M table streamed through? That is what you want: but you 
might either need to fiddle with hints to get it to happen

Details:Read uuids s of feed into  in-memory map on all nodes (mapjoin) 
Stream the 400M message records through the in-memory maps, copying id's from 
the all feed uuids  map to a  matched feed uuid's map for entries that have 
matches in the messages 

 Note: this way the 400M rows are only read once on the cluster. 
You can see whether hive can manage this or if you write a custom m/r job to do 
it.
 

2013/5/5 Peter Chu pete@outlook.com




It works but it takes a very long time because the subqueries in NOT IN 
contains 400 million rows (the message table in the example) and the feed table 
contains 3 million rows.

SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);

 Date: Sun, 5 May 2013 20:25:15 -0700
 From: michaelma...@yahoo.com
 Subject: Re: Hive QL - NOT IN, NOT EXIST
 To: user@hive.apache.org

 
 
 --- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote:
 
  I am wondering if there is any way to do this without resorting to

  using left outer join and finding nulls.
 
 I have found this to be an acceptable substitute.  Is it not working for you?
 
  

  

Re: HIVE-3979 in Hive 0.11

2013-05-05 Thread Carl Steinbach
Hi John,

This is a mistake in the release notes. It will be fixed in the next 0.11
release candidate.

Thanks.

Carl


On Sat, May 4, 2013 at 6:18 AM, John Omernik j...@omernik.com wrote:

 I see in the release notes for HIVE -3979

 [HIVE-3979 https://issues.apache.org/jira/browse/HIVE-3979] - Provide
 syntax for unescaped regex on rlike, and other regexp_* functions

 Yet when I click on that JIRA there are not notes etc.  Could it be that
 this was included by mistake?  I am curious, if this isn't a mistake, how
 this was implemented.


 Thanks!

 John



Re: [VOTE] Apache Hive 0.11.0 Release Candidate 1

2013-05-05 Thread Carl Steinbach
I took a quick look. Here's a list of things I noticed:

* NOTICE
** Copyright is out of date.

* README.txt
** Add one more '=' to the title underline.
** Requirements section only lists Hadoop 0.20.x
** Will it work with Java 1.7?

* RELEASE_NOTES.txt
** The 0.11 section lists tickets that aren't marked as resolved/fixed on
JIRA, and which are not included in the git commit log (e.g. HIVE-3979).
This may also extend to the 0.10 release notes. Related question: How were
these release notes generated?

* docs/index.html
** Update email list links (e.g. hive-u...@hadoop.apache.org)

* src/build.properties
** year=2012

* src/eclipse-files/.classpath
** The guava jar is not in the right location. This breaks the eclipse
import.

Thanks.

Carl



On Tue, Apr 30, 2013 at 5:18 PM, Ashutosh Chauhan hashut...@apache.orgwrote:

 Hey all,

 Based on feedback from folks, I have respun release candidate, RC1.
 Please take a look. It basically fixes the size bloat of tarball.

 Source tag for RC1 is at:

 https://svn.apache.org/repos/asf/hive/tags/release-0.11.0-rc1


 Source tar ball and convenience binary artifacts can be found
 at:http://people.apache.org/~hashutosh/hive-0.11.0-rc1/

 Maven artifacts for hive are available
 at:https://repository.apache.org/content/repositories/orgapachehive-158/

 Maven artifacts for hcatalog are available at:

 https://repository.apache.org/content/repositories/orgapachehcatalog-159/


 This release has many goodies including HiveServer2, integrated
 hcatalog, windowing and analytical functions, decimal data type,
 better query planning, performance enhancements and various bug fixes.
 In total, we resolved more than 350 issues. Full list of fixed issues
 can be found at:  http://s.apache.org/8Fr


 Voting will conclude in 72 hours.

 Hive PMC Members: Please test and vote.

 Thanks,

 Ashutosh (On behalf of Hive contributors who made 0.11 a possibility)