I even didn't know what your meaning about those 100k you two refered . On Thu, Jun 18, 2009 at 9:32 AM, Ashish Thusoo <[email protected]> wrote:
> my bad. yes it is 100K. That sounds a bit too much as well. > > Ashish > > ------------------------------ > *From:* Min Zhou [mailto:[email protected]] > *Sent:* Wednesday, June 17, 2009 6:25 PM > > *To:* [email protected] > *Subject:* Re: OutOfMemory when doing map-side join > > what's your 100kb standed for? > > On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <[email protected]> wrote: > >> hmm, that is a 100KB per my math. >> >> 20K * 100K = 2GB >> >> -- amr >> >> >> Ashish Thusoo wrote: >> >> That does not sound right. Each row is 100MB - that sounds too much... >> >> Ashish >> >> ------------------------------ >> *From:* Min Zhou [mailto:[email protected] <[email protected]>] >> *Sent:* Monday, June 15, 2009 7:16 PM >> *To:* [email protected] >> *Subject:* Re: OutOfMemory when doing map-side join >> >> 20k rows need 2G memory? so terrible. The whole small table of mine is >> less than 4MB, what about yours? >> >> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <[email protected]> wrote: >> >>> Set mapred.child.java.opts to increase mapper memory. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *From:* Namit Jain [mailto:[email protected]] >>> *Sent:* Monday, June 15, 2009 3:53 PM >>> *To:* [email protected] >>> *Subject:* RE: OutOfMemory when doing map-side join >>> >>> >>> >>> There are multiple things going on. >>> >>> >>> >>> Column pruning is not working with map-joins. It is being tracked at: >>> >>> >>> >>> https://issues.apache.org/jira/browse/HIVE-560 >>> >>> >>> >>> >>> >>> Also, since it is a Cartesian product, jdbm does not help - because a >>> single key can be very large. >>> >>> >>> >>> >>> >>> For now, you can do the column pruning yourself – create a new table with >>> only the columns needed and then >>> >>> join with the bigger table. >>> >>> >>> >>> You may still need to increase the mapper memory - I was able to load >>> about 20k rows with about 2G mapper. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *From:* Min Zhou [mailto:[email protected]] >>> *Sent:* Sunday, June 14, 2009 11:02 PM >>> *To:* [email protected] >>> *Subject:* Re: OutOfMemory when doing map-side join >>> >>> >>> >>> btw, that small table 'application' has only one partition right now, >>> 20k rows. >>> >>> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <[email protected]> wrote: >>> >>> failed with null pointer exception. >>> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select >>> x.url_pattern from application x where x.dt = '20090609') a join web_log w >>> where w.logdate='20090611' and w.url rlike a.url_pattern; >>> FAILED: Unknown exception : null >>> >>> >>> $cat /tmp/hive/hive.log | tail... >>> >>> 2009-06-15 13:57:02,933 ERROR ql.Driver >>> (SessionState.java:printError(279)) - FAILED: Unknown exception : null >>> java.lang.NullPointerException >>> at >>> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76) >>> at >>> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284) >>> at >>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217) >>> at >>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231) >>> at >>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231) >>> at >>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231) >>> at >>> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377) >>> at >>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608) >>> at >>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785) >>> at >>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76) >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177) >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209) >>> at >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176) >>> at >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216) >>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:165) >>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) >>> >>> >>> >>> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <[email protected]> wrote: >>> >>> The problem seems to be in partition pruning – The small table >>> ‘application’ is partitioned – and probably, there are 20k rows in the >>> partition >>> 20090609. >>> >>> Due to a bug, the pruning is not happening, and all partitions of >>> ‘application’ are being loaded – which may be too much for map-join to >>> handle. >>> This is a serious bug, but for now can you put in a subquery and try - >>> >>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern >>> from application x where x.dt = ‘20090609’) a join web_log w where >>> w.logdate='20090611' and w.url rlike a.url_pattern; >>> >>> >>> Please file a JIRA for the above. >>> >>> >>> >>> >>> >>> On 6/14/09 10:20 PM, "Min Zhou" <[email protected]> wrote: >>> >>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from >>> application a join web_log w where w.logdate='20090611' and w.url rlike >>> a.url_pattern and a.dt='20090609'; >>> OK >>> ABSTRACT SYNTAX TREE: >>> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF >>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) >>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) >>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. >>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) >>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. >>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) >>> '20090609'))))) >>> >>> STAGE DEPENDENCIES: >>> Stage-1 is a root stage >>> Stage-2 depends on stages: Stage-1 >>> Stage-0 is a root stage >>> >>> STAGE PLANS: >>> Stage: Stage-1 >>> Map Reduce >>> Alias -> Map Operator Tree: >>> w >>> Select Operator >>> expressions: >>> expr: url >>> type: string >>> expr: logdate >>> type: string >>> Common Join Operator >>> condition map: >>> Inner Join 0 to 1 >>> condition expressions: >>> 0 {0} {1} >>> 1 {0} {1} >>> keys: >>> 0 >>> 1 >>> Position of Big Table: 1 >>> File Output Operator >>> compressed: false >>> GlobalTableId: 0 >>> table: >>> input format: >>> org.apache.hadoop.mapred.SequenceFileInputFormat >>> output format: >>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat >>> Local Work: >>> Map Reduce Local Work >>> Alias -> Map Local Tables: >>> a >>> Fetch Operator >>> limit: -1 >>> Alias -> Map Local Operator Tree: >>> a >>> Select Operator >>> expressions: >>> expr: url_pattern >>> type: string >>> expr: dt >>> type: string >>> Common Join Operator >>> condition map: >>> Inner Join 0 to 1 >>> condition expressions: >>> 0 {0} {1} >>> 1 {0} {1} >>> keys: >>> 0 >>> 1 >>> Position of Big Table: 1 >>> File Output Operator >>> compressed: false >>> GlobalTableId: 0 >>> table: >>> input format: >>> org.apache.hadoop.mapred.SequenceFileInputFormat >>> output format: >>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat >>> >>> Stage: Stage-2 >>> Map Reduce >>> Alias -> Map Operator Tree: >>> hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004 >>> Select Operator >>> Filter Operator >>> predicate: >>> expr: (((3 = '20090611') and (2 regexp 0)) and (1 = >>> '20090609')) >>> type: boolean >>> Select Operator >>> expressions: >>> expr: 0 >>> type: string >>> expr: 2 >>> type: string >>> File Output Operator >>> compressed: true >>> GlobalTableId: 0 >>> table: >>> input format: >>> org.apache.hadoop.mapred.TextInputFormat >>> output format: >>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat >>> >>> Stage: Stage-0 >>> Fetch Operator >>> limit: -1 >>> >>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <[email protected]> wrote: >>> >>> I was looking at the code – and there may be a bug in cartesian product >>> codepath for map-join. >>> >>> Can you do a explain plan and send it ? >>> >>> >>> >>> >>> >>> >>> On 6/14/09 10:06 PM, "Min Zhou" <[email protected]> wrote: >>> >>> >>> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the >>> same exception. >>> 2. Actually, we used to do the same thing by loading small tables into >>> memory of each map node in normal map-reduce with the same cluster, where >>> same heap size is guranteed between running hive map-side join and our >>> map-reduce job. OOM exceptions never happened in that only 1MB would be >>> spent to load those 20k pieces of records while mapred.child.java.opts was >>> set to be -Xmx200m. >>> >>> here is the schema of our small table: >>> > describe application; >>> transaction_id string >>> subclass_id string >>> class_id string >>> memo string >>> url_alias string >>> url_pattern string >>> dt string (daily partitioned) >>> >>> Thanks, >>> Min >>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <[email protected]> wrote: >>> >>> 1. Can you reduce the number of cached rows and try ? >>> >>> 2. Were you using default memory settings of the mapper ? If yes, can can >>> increase it and try ? >>> >>> It would be useful to try both of them independently – it would give a >>> good idea of memory consumption of JDBM also. >>> >>> >>> Can you send the exact schema/data of the small table if possible ? You >>> can file a jira and load it there if it not a security issue. >>> >>> Thanks, >>> -namit >>> >>> >>> >>> On 6/14/09 9:23 PM, "Min Zhou" <[email protected]> wrote: >>> >>> 20k >>> >>> >>> >>> >>> >>> >>> >>> -- >>> My research interests are distributed systems, parallel computing and >>> bytecode based virtual machine. >>> >>> My profile: >>> http://www.linkedin.com/in/coderplay >>> My blog: >>> http://coderplay.javaeye.com >>> >>> >>> >>> >>> -- >>> My research interests are distributed systems, parallel computing and >>> bytecode based virtual machine. >>> >>> My profile: >>> http://www.linkedin.com/in/coderplay >>> My blog: >>> http://coderplay.javaeye.com >>> >> >> >> >> -- >> My research interests are distributed systems, parallel computing and >> bytecode based virtual machine. >> >> My profile: >> http://www.linkedin.com/in/coderplay >> My blog: >> http://coderplay.javaeye.com >> >> > > > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > My profile: > http://www.linkedin.com/in/coderplay > My blog: > http://coderplay.javaeye.com > -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
