Re: OutOfMemory when doing map-side join

Min Zhou Wed, 17 Jun 2009 18:25:42 -0700

what's your 100kb standed for?

On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <[email protected]> wrote:


>  hmm, that is a 100KB per my math.
>
> 20K * 100K = 2GB
>
> -- amr
>
>
> Ashish Thusoo wrote:
>
> That does not sound right. Each row is 100MB - that sounds too much...
>
> Ashish
>
>  ------------------------------
> *From:* Min Zhou [mailto:[email protected] <[email protected]>]
> *Sent:* Monday, June 15, 2009 7:16 PM
> *To:* [email protected]
> *Subject:* Re: OutOfMemory when doing map-side join
>
>  20k rows need 2G memory?  so terrible.  The whole small table of mine is
> less than 4MB,   what about yours?
>
> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <[email protected]> wrote:
>
>>  Set  mapred.child.java.opts to increase mapper memory.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Namit Jain [mailto:[email protected]]
>> *Sent:* Monday, June 15, 2009 3:53 PM
>> *To:* [email protected]
>>  *Subject:* RE: OutOfMemory when doing map-side join
>>
>>
>>
>> There are multiple things going on.
>>
>>
>>
>> Column pruning is not working with map-joins. It is being tracked at:
>>
>>
>>
>> https://issues.apache.org/jira/browse/HIVE-560
>>
>>
>>
>>
>>
>> Also, since it is a Cartesian product, jdbm does not help  - because a
>> single key can be very large.
>>
>>
>>
>>
>>
>> For now, you can do the column pruning yourself – create a new table with
>> only the columns needed and then
>>
>> join with the bigger table.
>>
>>
>>
>> You may still need to increase the mapper memory -  I was able to load
>> about 20k rows with about 2G mapper.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Min Zhou [mailto:[email protected]]
>> *Sent:* Sunday, June 14, 2009 11:02 PM
>> *To:* [email protected]
>> *Subject:* Re: OutOfMemory when doing map-side join
>>
>>
>>
>> btw, that small table 'application' has only one partition right now,  20k
>> rows.
>>
>> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <[email protected]> wrote:
>>
>> failed with null pointer exception.
>> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
>> x.url_pattern from application x where x.dt = '20090609') a join web_log w
>> where w.logdate='20090611' and w.url rlike a.url_pattern;
>> FAILED: Unknown exception : null
>>
>>
>> $cat /tmp/hive/hive.log | tail...
>>
>> 2009-06-15 13:57:02,933 ERROR ql.Driver
>> (SessionState.java:printError(279)) - FAILED: Unknown exception : null
>> java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>>         at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>>         at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>>         at
>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>>
>>
>> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <[email protected]> wrote:
>>
>> The problem seems to be in partition pruning – The small table
>> ‘application’ is partitioned – and probably, there are 20k rows in the
>> partition
>> 20090609.
>>
>> Due to a bug, the pruning is not happening, and all partitions of
>> ‘application’ are being loaded – which may be too much for map-join to
>> handle.
>> This is a serious bug, but for now can you put in a subquery and try -
>>
>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern
>> from application x where x.dt = ‘20090609’) a join web_log w where
>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>
>>
>> Please file a JIRA for the above.
>>
>>
>>
>>
>>
>> On 6/14/09 10:20 PM, "Min Zhou" <[email protected]> wrote:
>>
>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>> application a join web_log w where w.logdate='20090611' and w.url rlike
>> a.url_pattern and a.dt='20090609';
>> OK
>> ABSTRACT SYNTAX TREE:
>>   (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>> '20090609')))))
>>
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-2 depends on stages: Stage-1
>>   Stage-0 is a root stage
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Map Reduce
>>       Alias -> Map Operator Tree:
>>         w
>>             Select Operator
>>               expressions:
>>                     expr: url
>>                     type: string
>>                     expr: logdate
>>                     type: string
>>               Common Join Operator
>>                 condition map:
>>                      Inner Join 0 to 1
>>                 condition expressions:
>>                   0 {0} {1}
>>                   1 {0} {1}
>>                 keys:
>>                   0
>>                   1
>>                 Position of Big Table: 1
>>                 File Output Operator
>>                   compressed: false
>>                   GlobalTableId: 0
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>       Local Work:
>>         Map Reduce Local Work
>>           Alias -> Map Local Tables:
>>             a
>>               Fetch Operator
>>                 limit: -1
>>           Alias -> Map Local Operator Tree:
>>             a
>>                 Select Operator
>>                   expressions:
>>                         expr: url_pattern
>>                         type: string
>>                         expr: dt
>>                         type: string
>>                   Common Join Operator
>>                     condition map:
>>                          Inner Join 0 to 1
>>                     condition expressions:
>>                       0 {0} {1}
>>                       1 {0} {1}
>>                     keys:
>>                       0
>>                       1
>>                     Position of Big Table: 1
>>                     File Output Operator
>>                       compressed: false
>>                       GlobalTableId: 0
>>                       table:
>>                           input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>                           output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>
>>   Stage: Stage-2
>>     Map Reduce
>>       Alias -> Map Operator Tree:
>>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>>           Select Operator
>>             Filter Operator
>>               predicate:
>>                   expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>> '20090609'))
>>                   type: boolean
>>               Select Operator
>>                 expressions:
>>                       expr: 0
>>                       type: string
>>                       expr: 2
>>                       type: string
>>                 File Output Operator
>>                   compressed: true
>>                   GlobalTableId: 0
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>
>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <[email protected]> wrote:
>>
>> I was looking at the code – and there may be a bug in cartesian product
>> codepath for map-join.
>>
>> Can you do a explain plan and send it ?
>>
>>
>>
>>
>>
>>
>> On 6/14/09 10:06 PM, "Min Zhou" <[email protected]> wrote:
>>
>>
>> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
>> same exception.
>> 2. Actually, we used to do the same thing by loading small tables into
>> memory of each map node in normal map-reduce with the same cluster, where
>> same heap size is guranteed between running hive map-side join and our
>> map-reduce job.  OOM exceptions never happened in that only 1MB would be
>> spent to load those 20k pieces of records while mapred.child.java.opts was
>> set to be -Xmx200m.
>>
>> here is the schema of our small table:
>> > describe application;
>> transaction_id    string
>> subclass_id     string
>> class_id        string
>> memo string
>> url_alias    string
>> url_pattern     string
>> dt      string  (daily partitioned)
>>
>> Thanks,
>> Min
>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <[email protected]> wrote:
>>
>> 1. Can you reduce the number of cached rows and try ?
>>
>> 2. Were you using default memory settings of the mapper ? If yes, can can
>> increase it and try ?
>>
>> It would be useful to try both of them independently – it would give a
>> good idea of memory consumption of JDBM also.
>>
>>
>> Can you send the exact schema/data of the small table if possible ? You
>> can file a jira and load it there if it not a security issue.
>>
>> Thanks,
>> -namit
>>
>>
>>
>> On 6/14/09 9:23 PM, "Min Zhou" <[email protected]> wrote:
>>
>> 20k
>>
>>
>>
>>
>>
>>
>>
>>    --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
>


-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Reply via email to