[
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659934#comment-13659934
]
Hudson commented on HIVE-4440:
------------------------------
Integrated in Hive-trunk-hadoop2 #199 (See
[https://builds.apache.org/job/Hive-trunk-hadoop2/199/])
HIVE-4440 SMB Operator spills to disk like it's 1999 (Gunther Hagleitner via
omalley) (Revision 1483084)
Result = FAILURE
omalley :
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1483084
Files :
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java
> SMB Operator spills to disk like it's 1999
> ------------------------------------------
>
> Key: HIVE-4440
> URL: https://issues.apache.org/jira/browse/HIVE-4440
> Project: Hive
> Issue Type: Bug
> Reporter: Gunther Hagleitner
> Assignee: Gunther Hagleitner
> Fix For: 0.12.0
>
> Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch
>
>
> I was recently looking into some performance issue with a query that used SMB
> join and was running really slow. Turns out that the SMB join by default
> caches only 100 values per key before spilling to disk. That seems overly
> conservative to me. Changing the parameter resulted in a ~5x speedup - quite
> significant.
> The parameter is: hive.mapjoin.bucket.cache.size
> Which right now is only used the SMB Operator as far as I can tell.
> The parameter was introduced originally (3 yrs ago) for the map join operator
> (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in
> a different context though where you had to avoid running out of memory with
> the cached hash table in the same process, I think.
> Two things I'd like to propose:
> a) Rename it to what it does: hive.smbjoin.cache.rows
> b) Set it to something less restrictive: 10000
> If you string together a 5 table smb join with a map join and a map-side
> group by aggregation you might still run out of memory, but the renamed
> parameter should be easier to find and reduce. For most queries, I would
> think that 10000 is still a reasonable number to cache (On the reduce side we
> use 25000 for shuffle joins).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira