Hi all,
we had a query joining two tables, one of which had about 1 billions pieces
of records while the other had less than 20k. below is our query:

set hive.mapjoin.cache.numrows=20000;
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join
web_log w where and w.url rlike a.url_pattern

here is the stack trace

java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2786)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
        at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
        at 
java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1809)
        at java.io.ObjectOutputStream.write(ObjectOutputStream.java:681)
        at org.apache.hadoop.io.Text.write(Text.java:282)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinObjectValue.writeExternal(MapJoinObjectValue.java:126)
        at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1421)
        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1390)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
        at 
org.apache.hadoop.hive.ql.util.jdbm.htree.HashBucket.writeExternal(HashBucket.java:292)
        at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1421)
        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1390)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
        at 
org.apache.hadoop.hive.ql.util.jdbm.helper.Serialization.serialize(Serialization.java:93)
        at 
org.apache.hadoop.hive.ql.util.jdbm.helper.DefaultSerializer.serialize(DefaultSerializer.java:101)
        at 
org.apache.hadoop.hive.ql.util.jdbm.recman.BaseRecordManager.insert(BaseRecordManager.java:242)
        at 
org.apache.hadoop.hive.ql.util.jdbm.recman.CacheRecordManager.insert(CacheRecordManager.java:176)
        at 
org.apache.hadoop.hive.ql.util.jdbm.recman.CacheRecordManager.insert(CacheRecordManager.java:159)
        at 
org.apache.hadoop.hive.ql.util.jdbm.htree.HashDirectory.put(HashDirectory.java:249)
        at org.apache.hadoop.hive.ql.util.jdbm.htree.HTree.put(HTree.java:147)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:305)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:492)
        at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:76)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:492)
        at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:49)
        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:121)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

java.lang.OutOfMemoryError: GC overhead limit exceeded at
java.util.Arrays.copyOfRange(Arrays.java:3209) at
java.lang.String.(String.java:216) at
java.lang.StringBuffer.toString(StringBuffer.java:585) at
org.apache.log4j.PatternLayout.format(PatternLayout.java:505) at
org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302) at
org.apache.log4j.WriterAppender.append(WriterAppender.java:160) at
org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:55) at
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206) at
org.apache.log4j.Category.forcedLog(Category.java:391) at
org.apache.log4j.Category.log(Category.java:856) at
org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:133) at
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:387) at
org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:173) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at
org.apache.hadoop.mapred.Child.main(Child.java:158)


Actually, we used to do the same thing by loading small tables into memory
of each map node in normal map-reduce job. OOM exceptions never happened in
that only 1MB would be spent to load those 20k pieces of records.  Can you
give me an explanation?

Thanks,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Reply via email to