[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2014-05-19 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001506#comment-14001506
 ] 

Lefty Leverenz commented on HIVE-4440:
--

Documented *hive.smbjoin.cache.rows* in the wiki, and revised 
*hive.mapjoin.bucket.cache.size*:

* [hive.smbjoin.cache.rows | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.smbjoin.cache.rows]
* [hive.mapjoin.bucket.cache.size | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.mapjoin.bucket.cache.size]

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2014-05-18 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001270#comment-14001270
 ] 

Lefty Leverenz commented on HIVE-4440:
--

Hive 0.13.0 did not remove *hive.mapjoin.bucket.cache.size*.  Also, the comment 
that says it should be removed has a typo in the name of the new parameter -- 
it should be *hive.smbjoin.cache.rows*, not hive.smbjoin.cache.row:

{quote}
+// hive.mapjoin.bucket.cache.size has been replaced by 
hive.smbjoin.cache.row,
+// need to remove by hive .13. Also, do not change default (see SMB 
operator)
{quote}

Instead of creating a new jira for this, I'll add a comment on HIVE-6586 (for 
HIVE-6037).

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2014-05-18 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001280#comment-14001280
 ] 

Lefty Leverenz commented on HIVE-4440:
--

Here's the comment I added to HIVE-6586:

* [comment about hive.mapjoin.bucket.cache.size and hive.smbjoin.cache.rows | 
https://issues.apache.org/jira/browse/HIVE-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001274#comment-14001274]

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659362#comment-13659362
 ] 

Hudson commented on HIVE-4440:
--

Integrated in Hive-trunk-h0.21 #2105 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2105/])
HIVE-4440 SMB Operator spills to disk like it's 1999 (Gunther Hagleitner via
omalley) (Revision 1483084)

 Result = FAILURE
omalley : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1483084
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java


 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659934#comment-13659934
 ] 

Hudson commented on HIVE-4440:
--

Integrated in Hive-trunk-hadoop2 #199 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/199/])
HIVE-4440 SMB Operator spills to disk like it's 1999 (Gunther Hagleitner via
omalley) (Revision 1483084)

 Result = FAILURE
omalley : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1483084
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java


 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-15 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658727#comment-13658727
 ] 

Gunther Hagleitner commented on HIVE-4440:
--

[~owen.omalley]: Ran all tests - no failures.

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-14 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657645#comment-13657645
 ] 

Owen O'Malley commented on HIVE-4440:
-

Patch.2 looks good. I agree with Namit that supporting the old config makes 
sense. Can you run tests on it. We might consider warning in Hive 0.13 and 
removing in Hive 0.14. *smile*

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-02 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647395#comment-13647395
 ] 

Gunther Hagleitner commented on HIVE-4440:
--

Thanks :-)

Patch .2 honors the old parameter unless it's at the default in which case it 
uses the new one. I also put documentation around it. 

You bring up a good point, but are you sure it's necessary to support both in 
this case though? It's just slightly ugly in the code and requires us to move 
in again to remove later. My thinking is this: If you use the old parameter, 
it's probably because you needed to up it to get better performance - in this 
case the new default should most likely be ok for you. Do you think there's 
going to be cases where this falls flat?

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-04-29 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645178#comment-13645178
 ] 

Namit Jain commented on HIVE-4440:
--

I really like the title of the jira.

Changing the parameter name is backward incompatible.
Can you support both the current parameter and the proposed parameter for now ?
Document it clearly, and say that the current parameter 
hive.mapjoin.bucket.cache.size will not be supported
for this from 0.13 or something like that.

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Attachments: HIVE-4440.1.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira