Was there a typo below (v1 -> e1) ? event_log v1 JOIN event_log e2 ON
On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <r...@mayin.org> wrote: > Consider the following table (I've omitted things like additional columns > and the serde specification since I think they are mostly irrelevant): > > CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id > string ...) PARTITIONED BY (dt string) ROW FORMAT ...; > > Where visitor_id is assigned to every visitor and user_id is only present > for logged in users. In other words, visitor_id is never NULL, but user_id > can be. > > I want to find out, for a given day, how many visitors were also seen > yesterday. That is, I want the subset of visitor_id's which were also > present in the previous day. > > Here's a query I wrote: > > SELECT count(1) FROM ( > SELECT DISTINCT e1.visitor_id FROM > event_log v1 JOIN event_log e2 ON > (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22' > AND e2.dt='2010-08-21')) event_log; > > The average number of total records for each partition is about 350K. > > This works, no problems. > > But if I issue the same query with visitor_id replaced by user_id, the > Stage-1 reducer does not seem to make progress -- it remains stuck around > say 68% or so in several runs. > > The last few lines from the log: > > 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 > forwarding 214000000 rows > 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 > forwarding 214000000 rows > 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 > forwarding 215000000 rows > 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 > forwarding 215000000 rows > 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 > forwarding 216000000 rows > 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 > forwarding 216000000 rows > 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 > forwarding 217000000 rows > 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 > forwarding 217000000 rows > > The number of rows keep on increasing without bound it seems. > > Adding the condition "e1.user_id IS NOT NULL" to the join condition helps > matters considerably in that the job finishes but it is still not as fast as > the query for visitor_id. > > Any ideas on how to improve the speed of the query? > > This was on hive-0.5.0+20 from CDH3. > > Thanks, > Raj > >