I even didn't know what  your meaning about those 100k you two refered .

On Thu, Jun 18, 2009 at 9:32 AM, Ashish Thusoo <[email protected]> wrote:

>  my bad. yes it is 100K. That sounds a bit too much as well.
>
> Ashish
>
>  ------------------------------
> *From:* Min Zhou [mailto:[email protected]]
> *Sent:* Wednesday, June 17, 2009 6:25 PM
>
> *To:* [email protected]
> *Subject:* Re: OutOfMemory when doing map-side join
>
> what's your 100kb standed for?
>
> On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <[email protected]> wrote:
>
>> hmm, that is a 100KB per my math.
>>
>> 20K * 100K = 2GB
>>
>> -- amr
>>
>>
>> Ashish Thusoo wrote:
>>
>> That does not sound right. Each row is 100MB - that sounds too much...
>>
>> Ashish
>>
>>  ------------------------------
>> *From:* Min Zhou [mailto:[email protected] <[email protected]>]
>> *Sent:* Monday, June 15, 2009 7:16 PM
>> *To:* [email protected]
>> *Subject:* Re: OutOfMemory when doing map-side join
>>
>> 20k rows need 2G memory?  so terrible.  The whole small table of mine is
>> less than 4MB,   what about yours?
>>
>> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <[email protected]> wrote:
>>
>>>  Set  mapred.child.java.opts to increase mapper memory.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Namit Jain [mailto:[email protected]]
>>> *Sent:* Monday, June 15, 2009 3:53 PM
>>> *To:* [email protected]
>>> *Subject:* RE: OutOfMemory when doing map-side join
>>>
>>>
>>>
>>> There are multiple things going on.
>>>
>>>
>>>
>>> Column pruning is not working with map-joins. It is being tracked at:
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/HIVE-560
>>>
>>>
>>>
>>>
>>>
>>> Also, since it is a Cartesian product, jdbm does not help  - because a
>>> single key can be very large.
>>>
>>>
>>>
>>>
>>>
>>> For now, you can do the column pruning yourself – create a new table with
>>> only the columns needed and then
>>>
>>> join with the bigger table.
>>>
>>>
>>>
>>> You may still need to increase the mapper memory -  I was able to load
>>> about 20k rows with about 2G mapper.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Min Zhou [mailto:[email protected]]
>>> *Sent:* Sunday, June 14, 2009 11:02 PM
>>> *To:* [email protected]
>>> *Subject:* Re: OutOfMemory when doing map-side join
>>>
>>>
>>>
>>> btw, that small table 'application' has only one partition right now,
>>> 20k rows.
>>>
>>> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <[email protected]> wrote:
>>>
>>> failed with null pointer exception.
>>> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
>>> x.url_pattern from application x where x.dt = '20090609') a join web_log w
>>> where w.logdate='20090611' and w.url rlike a.url_pattern;
>>> FAILED: Unknown exception : null
>>>
>>>
>>> $cat /tmp/hive/hive.log | tail...
>>>
>>> 2009-06-15 13:57:02,933 ERROR ql.Driver
>>> (SessionState.java:printError(279)) - FAILED: Unknown exception : null
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>>
>>>
>>>
>>> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <[email protected]> wrote:
>>>
>>> The problem seems to be in partition pruning – The small table
>>> ‘application’ is partitioned – and probably, there are 20k rows in the
>>> partition
>>> 20090609.
>>>
>>> Due to a bug, the pruning is not happening, and all partitions of
>>> ‘application’ are being loaded – which may be too much for map-join to
>>> handle.
>>> This is a serious bug, but for now can you put in a subquery and try -
>>>
>>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern
>>> from application x where x.dt = ‘20090609’) a join web_log w where
>>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>>
>>>
>>> Please file a JIRA for the above.
>>>
>>>
>>>
>>>
>>>
>>> On 6/14/09 10:20 PM, "Min Zhou" <[email protected]> wrote:
>>>
>>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>>> application a join web_log w where w.logdate='20090611' and w.url rlike
>>> a.url_pattern and a.dt='20090609';
>>> OK
>>> ABSTRACT SYNTAX TREE:
>>>   (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>>> '20090609')))))
>>>
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-2 depends on stages: Stage-1
>>>   Stage-0 is a root stage
>>>
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Map Reduce
>>>       Alias -> Map Operator Tree:
>>>         w
>>>             Select Operator
>>>               expressions:
>>>                     expr: url
>>>                     type: string
>>>                     expr: logdate
>>>                     type: string
>>>               Common Join Operator
>>>                 condition map:
>>>                      Inner Join 0 to 1
>>>                 condition expressions:
>>>                   0 {0} {1}
>>>                   1 {0} {1}
>>>                 keys:
>>>                   0
>>>                   1
>>>                 Position of Big Table: 1
>>>                 File Output Operator
>>>                   compressed: false
>>>                   GlobalTableId: 0
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>>       Local Work:
>>>         Map Reduce Local Work
>>>           Alias -> Map Local Tables:
>>>             a
>>>               Fetch Operator
>>>                 limit: -1
>>>           Alias -> Map Local Operator Tree:
>>>             a
>>>                 Select Operator
>>>                   expressions:
>>>                         expr: url_pattern
>>>                         type: string
>>>                         expr: dt
>>>                         type: string
>>>                   Common Join Operator
>>>                     condition map:
>>>                          Inner Join 0 to 1
>>>                     condition expressions:
>>>                       0 {0} {1}
>>>                       1 {0} {1}
>>>                     keys:
>>>                       0
>>>                       1
>>>                     Position of Big Table: 1
>>>                     File Output Operator
>>>                       compressed: false
>>>                       GlobalTableId: 0
>>>                       table:
>>>                           input format:
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>>                           output format:
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>>
>>>   Stage: Stage-2
>>>     Map Reduce
>>>       Alias -> Map Operator Tree:
>>>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>>>           Select Operator
>>>             Filter Operator
>>>               predicate:
>>>                   expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>>> '20090609'))
>>>                   type: boolean
>>>               Select Operator
>>>                 expressions:
>>>                       expr: 0
>>>                       type: string
>>>                       expr: 2
>>>                       type: string
>>>                 File Output Operator
>>>                   compressed: true
>>>                   GlobalTableId: 0
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>
>>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <[email protected]> wrote:
>>>
>>> I was looking at the code – and there may be a bug in cartesian product
>>> codepath for map-join.
>>>
>>> Can you do a explain plan and send it ?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 6/14/09 10:06 PM, "Min Zhou" <[email protected]> wrote:
>>>
>>>
>>> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
>>> same exception.
>>> 2. Actually, we used to do the same thing by loading small tables into
>>> memory of each map node in normal map-reduce with the same cluster, where
>>> same heap size is guranteed between running hive map-side join and our
>>> map-reduce job.  OOM exceptions never happened in that only 1MB would be
>>> spent to load those 20k pieces of records while mapred.child.java.opts was
>>> set to be -Xmx200m.
>>>
>>> here is the schema of our small table:
>>> > describe application;
>>> transaction_id    string
>>> subclass_id     string
>>> class_id        string
>>> memo string
>>> url_alias    string
>>> url_pattern     string
>>> dt      string  (daily partitioned)
>>>
>>> Thanks,
>>> Min
>>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <[email protected]> wrote:
>>>
>>> 1. Can you reduce the number of cached rows and try ?
>>>
>>> 2. Were you using default memory settings of the mapper ? If yes, can can
>>> increase it and try ?
>>>
>>> It would be useful to try both of them independently – it would give a
>>> good idea of memory consumption of JDBM also.
>>>
>>>
>>> Can you send the exact schema/data of the small table if possible ? You
>>> can file a jira and load it there if it not a security issue.
>>>
>>> Thanks,
>>> -namit
>>>
>>>
>>>
>>> On 6/14/09 9:23 PM, "Min Zhou" <[email protected]> wrote:
>>>
>>> 20k
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>  --
>>> My research interests are distributed systems, parallel computing and
>>> bytecode based virtual machine.
>>>
>>> My profile:
>>> http://www.linkedin.com/in/coderplay
>>> My blog:
>>> http://coderplay.javaeye.com
>>>
>>>
>>>
>>>
>>> --
>>> My research interests are distributed systems, parallel computing and
>>> bytecode based virtual machine.
>>>
>>> My profile:
>>> http://www.linkedin.com/in/coderplay
>>> My blog:
>>> http://coderplay.javaeye.com
>>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Reply via email to