That does not sound right. Each row is 100MB - that sounds too much...
Ashish
------------------------------------------------------------------------
*From:* Min Zhou [mailto:[email protected]]
*Sent:* Monday, June 15, 2009 7:16 PM
*To:* [email protected]
*Subject:* Re: OutOfMemory when doing map-side join
20k rows need 2G memory? so terrible. The whole small table of mine
is less than 4MB, what about yours?
On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <[email protected]
<mailto:[email protected]>> wrote:
Set mapred.child.java.opts to increase mapper memory.
*From:* Namit Jain [mailto:[email protected]
<mailto:[email protected]>]
*Sent:* Monday, June 15, 2009 3:53 PM
*To:* [email protected] <mailto:[email protected]>
*Subject:* RE: OutOfMemory when doing map-side join
There are multiple things going on.
Column pruning is not working with map-joins. It is being tracked at:
https://issues.apache.org/jira/browse/HIVE-560
Also, since it is a Cartesian product, jdbm does not help -
because a single key can be very large.
For now, you can do the column pruning yourself -- create a new
table with only the columns needed and then
join with the bigger table.
You may still need to increase the mapper memory - I was able to
load about 20k rows with about 2G mapper.
*From:* Min Zhou [mailto:[email protected]
<mailto:[email protected]>]
*Sent:* Sunday, June 14, 2009 11:02 PM
*To:* [email protected] <mailto:[email protected]>
*Subject:* Re: OutOfMemory when doing map-side join
btw, that small table 'application' has only one partition right
now, 20k rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <[email protected]
<mailto:[email protected]>> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
x.url_pattern from application x where x.dt = '20090609') a join
web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null
$cat /tmp/hive/hive.log | tail...
2009-06-15 13:57:02,933 ERROR ql.Driver
(SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
at
org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <[email protected]
<mailto:[email protected]>> wrote:
The problem seems to be in partition pruning -- The small table
'application' is partitioned -- and probably, there are 20k rows
in the partition
20090609.
Due to a bug, the pruning is not happening, and all partitions of
'application' are being loaded -- which may be too much for
map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
x.url_pattern from application x where x.dt = '20090609') a join
web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
Please file a JIRA for the above.
On 6/14/09 10:20 PM, "Min Zhou" <[email protected]
<mailto:[email protected]>> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url
from application a join web_log w where w.logdate='20090611'
and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a)
(TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN
(TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a)
url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url)))
(TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate)
'20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
(TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a)
dt) '20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0))
and (1 = '20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format:
org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain
<[email protected] <mailto:[email protected]>> wrote:
I was looking at the code -- and there may be a bug in
cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <[email protected]
<mailto:[email protected]>> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed
with the same exception.
2. Actually, we used to do the same thing by loading small
tables into memory of each map node in normal map-reduce with
the same cluster, where same heap size is guranteed between
running hive map-side join and our map-reduce job. OOM
exceptions never happened in that only 1MB would be spent to
load those 20k pieces of records while mapred.child.java.opts
was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain
<[email protected] <mailto:[email protected]>> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If
yes, can can increase it and try ?
It would be useful to try both of them independently -- it
would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if
possible ? You can file a jira and load it there if it not a
security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <[email protected]
<mailto:[email protected]>> wrote:
20k
--
My research interests are distributed systems, parallel computing
and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing
and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com