hmm, that is a 100KB per my math.

20K * 100K = 2GB

-- amr

Ashish Thusoo wrote:
That does not sound right. Each row is 100MB - that sounds too much...
Ashish

------------------------------------------------------------------------
*From:* Min Zhou [mailto:[email protected]]
*Sent:* Monday, June 15, 2009 7:16 PM
*To:* [email protected]
*Subject:* Re: OutOfMemory when doing map-side join

20k rows need 2G memory? so terrible. The whole small table of mine is less than 4MB, what about yours?

On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <[email protected] <mailto:[email protected]>> wrote:

    Set  mapred.child.java.opts to increase mapper memory.

    *From:* Namit Jain [mailto:[email protected]
    <mailto:[email protected]>]
    *Sent:* Monday, June 15, 2009 3:53 PM

    *To:* [email protected] <mailto:[email protected]>
    *Subject:* RE: OutOfMemory when doing map-side join

    There are multiple things going on.

    Column pruning is not working with map-joins. It is being tracked at:

    https://issues.apache.org/jira/browse/HIVE-560

    Also, since it is a Cartesian product, jdbm does not help  -
    because a single key can be very large.

    For now, you can do the column pruning yourself -- create a new
    table with only the columns needed and then

    join with the bigger table.

    You may still need to increase the mapper memory -  I was able to
    load about 20k rows with about 2G mapper.

    *From:* Min Zhou [mailto:[email protected]
    <mailto:[email protected]>]
    *Sent:* Sunday, June 14, 2009 11:02 PM
    *To:* [email protected] <mailto:[email protected]>
    *Subject:* Re: OutOfMemory when doing map-side join

    btw, that small table 'application' has only one partition right
    now,  20k rows.

    On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <[email protected]
    <mailto:[email protected]>> wrote:

    failed with null pointer exception.
    hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
    x.url_pattern from application x where x.dt = '20090609') a join
    web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
    FAILED: Unknown exception : null


    $cat /tmp/hive/hive.log | tail...

    2009-06-15 13:57:02,933 ERROR ql.Driver
    (SessionState.java:printError(279)) - FAILED: Unknown exception : null
    java.lang.NullPointerException
            at
    
org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
            at
    
org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
            at
    
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
            at
    
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
            at
    
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
            at
    
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
            at
    
org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
            at
    
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
            at
    
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
            at
    
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
            at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
            at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
            at
    org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
            at
    org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
            at
    org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at
    
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
            at
    
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
            at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
            at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

    On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <[email protected]
    <mailto:[email protected]>> wrote:

    The problem seems to be in partition pruning -- The small table
    'application' is partitioned -- and probably, there are 20k rows
    in the partition
    20090609.

    Due to a bug, the pruning is not happening, and all partitions of
    'application' are being loaded -- which may be too much for
    map-join to handle.
    This is a serious bug, but for now can you put in a subquery and try -

    select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
    x.url_pattern from application x where x.dt = '20090609') a join
    web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


    Please file a JIRA for the above.





    On 6/14/09 10:20 PM, "Min Zhou" <[email protected]
    <mailto:[email protected]>> wrote:

        hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url
        from application a join web_log w where w.logdate='20090611'
        and w.url rlike a.url_pattern and a.dt='20090609';
        OK
        ABSTRACT SYNTAX TREE:
          (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a)
        (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
        TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN
        (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a)
        url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url)))
        (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate)
        '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
        (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a)
        dt) '20090609')))))

        STAGE DEPENDENCIES:
          Stage-1 is a root stage
          Stage-2 depends on stages: Stage-1
          Stage-0 is a root stage

        STAGE PLANS:
          Stage: Stage-1
            Map Reduce
              Alias -> Map Operator Tree:
                w
                    Select Operator
                      expressions:
                            expr: url
                            type: string
                            expr: logdate
                            type: string
                      Common Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {0} {1}
                          1 {0} {1}
                        keys:
                          0
                          1
                        Position of Big Table: 1
                        File Output Operator
                          compressed: false
                          GlobalTableId: 0
                          table:
                              input format:
        org.apache.hadoop.mapred.SequenceFileInputFormat
                              output format:
        org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
              Local Work:
                Map Reduce Local Work
                  Alias -> Map Local Tables:
                    a
                      Fetch Operator
                        limit: -1
                  Alias -> Map Local Operator Tree:
                    a
                        Select Operator
                          expressions:
                                expr: url_pattern
                                type: string
                                expr: dt
                                type: string
                          Common Join Operator
                            condition map:
                                 Inner Join 0 to 1
                            condition expressions:
                              0 {0} {1}
                              1 {0} {1}
                            keys:
                              0
                              1
                            Position of Big Table: 1
                            File Output Operator
                              compressed: false
                              GlobalTableId: 0
                              table:
                                  input format:
        org.apache.hadoop.mapred.SequenceFileInputFormat
                                  output format:
        org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

          Stage: Stage-2
            Map Reduce
              Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
                  Select Operator
                    Filter Operator
                      predicate:
                          expr: (((3 = '20090611') and (2 regexp 0))
        and (1 = '20090609'))
                          type: boolean
                      Select Operator
                        expressions:
                              expr: 0
                              type: string
                              expr: 2
                              type: string
                        File Output Operator
                          compressed: true
                          GlobalTableId: 0
                          table:
                              input format:
        org.apache.hadoop.mapred.TextInputFormat
                              output format:
        org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

          Stage: Stage-0
            Fetch Operator
              limit: -1

        On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain
        <[email protected] <mailto:[email protected]>> wrote:

        I was looking at the code -- and there may be a bug in
        cartesian product codepath for map-join.

        Can you do a explain plan and send it ?






        On 6/14/09 10:06 PM, "Min Zhou" <[email protected]
        <mailto:[email protected]>> wrote:


        1. tried setting hive.mapjoin.cache.numrows to be 100,  failed
        with the same exception.
        2. Actually, we used to do the same thing by loading small
        tables into memory of each map node in normal map-reduce with
        the same cluster, where same heap size is guranteed between
        running hive map-side join and our map-reduce job.  OOM
        exceptions never happened in that only 1MB would be spent to
        load those 20k pieces of records while mapred.child.java.opts
        was set to be -Xmx200m.

        here is the schema of our small table:
        > describe application;
        transaction_id    string
        subclass_id     string
        class_id        string
        memo string
        url_alias    string
        url_pattern     string
        dt      string  (daily partitioned)
Thanks,
        Min
        On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain
        <[email protected] <mailto:[email protected]>> wrote:

        1. Can you reduce the number of cached rows and try ?

        2. Were you using default memory settings of the mapper ? If
        yes, can can increase it and try ?

        It would be useful to try both of them independently -- it
        would give a good idea of memory consumption of JDBM also.


        Can you send the exact schema/data of the small table if
        possible ? You can file a jira and load it there if it not a
        security issue.

        Thanks,
        -namit



        On 6/14/09 9:23 PM, "Min Zhou" <[email protected]
        <mailto:[email protected]>> wrote:

        20k



-- My research interests are distributed systems, parallel computing
    and bytecode based virtual machine.

    My profile:
    http://www.linkedin.com/in/coderplay
    My blog:
    http://coderplay.javaeye.com




-- My research interests are distributed systems, parallel computing
    and bytecode based virtual machine.

    My profile:
    http://www.linkedin.com/in/coderplay
    My blog:
    http://coderplay.javaeye.com




--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Reply via email to