That does not sound right. Each row is 100MB - that sounds too much...

Ashish

________________________________
From: Min Zhou [mailto:[email protected]]
Sent: Monday, June 15, 2009 7:16 PM
To: [email protected]
Subject: Re: OutOfMemory when doing map-side join

20k rows need 2G memory?  so terrible.  The whole small table of mine is less 
than 4MB,   what about yours?

On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:

Set  mapred.child.java.opts to increase mapper memory.









From: Namit Jain [mailto:[email protected]<mailto:[email protected]>]
Sent: Monday, June 15, 2009 3:53 PM

To: [email protected]<mailto:[email protected]>
Subject: RE: OutOfMemory when doing map-side join



There are multiple things going on.



Column pruning is not working with map-joins. It is being tracked at:



https://issues.apache.org/jira/browse/HIVE-560





Also, since it is a Cartesian product, jdbm does not help  - because a single 
key can be very large.





For now, you can do the column pruning yourself - create a new table with only 
the columns needed and then

join with the bigger table.



You may still need to increase the mapper memory -  I was able to load about 
20k rows with about 2G mapper.















From: Min Zhou [mailto:[email protected]<mailto:[email protected]>]
Sent: Sunday, June 14, 2009 11:02 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: OutOfMemory when doing map-side join



btw, that small table 'application' has only one partition right now,  20k rows.

On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou 
<[email protected]<mailto:[email protected]>> wrote:

failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern 
from application x where x.dt = '20090609') a join web_log w where 
w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - 
FAILED: Unknown exception : null
java.lang.NullPointerException
        at 
org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at 
org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at 
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at 
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at 
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at 
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at 
org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)



On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:

The problem seems to be in partition pruning - The small table 'application' is 
partitioned - and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' 
are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from 
application x where x.dt = '20090609') a join web_log w where 
w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.




On 6/14/09 10:20 PM, "Min Zhou" 
<[email protected]<mailto:[email protected]>> wrote:

hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a 
join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and 
a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log 
w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
(TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. 
(TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) 
(TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. 
(TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. 
(TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = 
'20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:

I was looking at the code - and there may be a bug in cartesian product 
codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" 
<[email protected]<mailto:[email protected]>> wrote:

1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same 
exception.
2. Actually, we used to do the same thing by loading small tables into memory 
of each map node in normal map-reduce with the same cluster, where same heap 
size is guranteed between running hive map-side join and our map-reduce job.  
OOM exceptions never happened in that only 1MB would be spent to load those 20k 
pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:

1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can 
increase it and try ?

It would be useful to try both of them independently - it would give a good 
idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can 
file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" 
<[email protected]<mailto:[email protected]>> wrote:

20k







--
My research interests are distributed systems, parallel computing and bytecode 
based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode 
based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode 
based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Reply via email to