Re: OutOfMemory when doing map-side join

Min Zhou Sun, 14 Jun 2009 22:21:16 -0700

hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application
a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern
and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
(TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
(TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
'20090609')))))


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
'20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <[email protected]> wrote:

>  I was looking at the code – and there may be a bug in cartesian product
> codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <[email protected]> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
> same exception.
> 2. Actually, we used to do the same thing by loading small tables into
> memory of each map node in normal map-reduce with the same cluster, where
> same heap size is guranteed between running hive map-side join and our
> map-reduce job.  OOM exceptions never happened in that only 1MB would be
> spent to load those 20k pieces of records while mapred.child.java.opts was
> set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id    string
> subclass_id     string
> class_id        string
> memo string
> url_alias    string
> url_pattern     string
> dt      string  (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <[email protected]> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <[email protected]> wrote:
>
> 20k
>
>
>
>


-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Reply via email to