Re: (Self) Joins on NULLable columns takes forever

Ted Yu Mon, 23 Aug 2010 15:43:18 -0700

Was there a typo below (v1 -> e1) ?

event_log v1 JOIN event_log e2 ON


On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <r...@mayin.org> wrote:

> Consider the following table (I've omitted things like additional columns
> and the serde specification since I think they are mostly irrelevant):
>
> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id
> string ...) PARTITIONED BY (dt string) ROW FORMAT ...;
>
> Where visitor_id is assigned to every visitor and user_id is only present
> for logged in users.  In other words, visitor_id is never NULL, but user_id
> can be.
>
> I want to find out, for a given day, how many visitors were also seen
> yesterday.  That is, I want the subset of visitor_id's which were also
> present in the previous day.
>
> Here's a query I wrote:
>
> SELECT count(1) FROM (
>       SELECT DISTINCT e1.visitor_id FROM
>             event_log v1 JOIN event_log e2 ON
>                     (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22'
> AND e2.dt='2010-08-21')) event_log;
>
> The average number of total records for each partition is about 350K.
>
> This works, no problems.
>
> But if I issue the same query with visitor_id replaced by user_id, the
> Stage-1 reducer does not seem to make progress -- it remains stuck around
> say 68% or so in several runs.
>
> The last few lines from the log:
>
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 
> forwarding 214000000 rows
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 
> forwarding 214000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 
> forwarding 215000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 
> forwarding 215000000 rows
> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 
> forwarding 216000000 rows
> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 
> forwarding 216000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 
> forwarding 217000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 
> forwarding 217000000 rows
>
> The number of rows keep on increasing without bound it seems.
>
> Adding the condition "e1.user_id IS NOT NULL" to the join condition helps
> matters considerably in that the job finishes but it is still not as fast as
> the query for visitor_id.
>
> Any ideas on how to improve the speed of the query?
>
> This was on hive-0.5.0+20 from CDH3.
>
> Thanks,
> Raj
>
>

Re: (Self) Joins on NULLable columns takes forever

Reply via email to