subject:"\[jira\] \[Commented\] \(HIVE\-4440\) SMB Operator spills to disk like it's 1999"

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2014-05-19 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001506#comment-14001506
 ] 

Lefty Leverenz commented on HIVE-4440:
--

Documented *hive.smbjoin.cache.rows* in the wiki, and revised 
*hive.mapjoin.bucket.cache.size*:

* [hive.smbjoin.cache.rows | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.smbjoin.cache.rows]
* [hive.mapjoin.bucket.cache.size | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.mapjoin.bucket.cache.size]

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2014-05-18 Thread Lefty Leverenz (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001270#comment-14001270
]

Lefty Leverenz commented on HIVE-4440:
--

Hive 0.13.0 did not remove *hive.mapjoin.bucket.cache.size*. Also, the comment
that says it should be removed has a typo in the name of the new parameter --
it should be *hive.smbjoin.cache.rows*, not hive.smbjoin.cache.row:

{quote}
+// hive.mapjoin.bucket.cache.size has been replaced by
hive.smbjoin.cache.row,
+// need to remove by hive .13. Also, do not change default (see SMB
operator)
{quote}

Instead of creating a new jira for this, I'll add a comment on HIVE-6586 (for
HIVE-6037).

SMB Operator spills to disk like it's 1999
--

Key: HIVE-4440
URL: https://issues.apache.org/jira/browse/HIVE-4440
Project: Hive
Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
Fix For: 0.12.0

Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch

I was recently looking into some performance issue with a query that used SMB
join and was running really slow. Turns out that the SMB join by default
caches only 100 values per key before spilling to disk. That seems overly
conservative to me. Changing the parameter resulted in a ~5x speedup - quite
significant.
The parameter is: hive.mapjoin.bucket.cache.size
Which right now is only used the SMB Operator as far as I can tell.
The parameter was introduced originally (3 yrs ago) for the map join operator
(looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in
a different context though where you had to avoid running out of memory with
the cached hash table in the same process, I think.
Two things I'd like to propose:
a) Rename it to what it does: hive.smbjoin.cache.rows
b) Set it to something less restrictive: 1
If you string together a 5 table smb join with a map join and a map-side
group by aggregation you might still run out of memory, but the renamed
parameter should be easier to find and reduce. For most queries, I would
think that 1 is still a reasonable number to cache (On the reduce side we
use 25000 for shuffle joins).

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2014-05-18 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001280#comment-14001280
 ] 

Lefty Leverenz commented on HIVE-4440:
--

Here's the comment I added to HIVE-6586:

* [comment about hive.mapjoin.bucket.cache.size and hive.smbjoin.cache.rows | 
https://issues.apache.org/jira/browse/HIVE-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001274#comment-14001274]

 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659362#comment-13659362
 ] 

Hudson commented on HIVE-4440:
--

Integrated in Hive-trunk-h0.21 #2105 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2105/])
HIVE-4440 SMB Operator spills to disk like it's 1999 (Gunther Hagleitner via
omalley) (Revision 1483084)

 Result = FAILURE
omalley : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1483084
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java


 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659934#comment-13659934
 ] 

Hudson commented on HIVE-4440:
--

Integrated in Hive-trunk-hadoop2 #199 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/199/])
HIVE-4440 SMB Operator spills to disk like it's 1999 (Gunther Hagleitner via
omalley) (Revision 1483084)

 Result = FAILURE
omalley : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1483084
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java


 SMB Operator spills to disk like it's 1999
 --

 Key: HIVE-4440
 URL: https://issues.apache.org/jira/browse/HIVE-4440
 Project: Hive
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.12.0

 Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch


 I was recently looking into some performance issue with a query that used SMB 
 join and was running really slow. Turns out that the SMB join by default 
 caches only 100 values per key before spilling to disk. That seems overly 
 conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
 significant.
 The parameter is: hive.mapjoin.bucket.cache.size
 Which right now is only used the SMB Operator as far as I can tell.
 The parameter was introduced originally (3 yrs ago) for the map join operator 
 (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
 a different context though where you had to avoid running out of memory with 
 the cached hash table in the same process, I think.
 Two things I'd like to propose:
 a) Rename it to what it does: hive.smbjoin.cache.rows
 b) Set it to something less restrictive: 1
 If you string together a 5 table smb join with a map join and a map-side 
 group by aggregation you might still run out of memory, but the renamed 
 parameter should be easier to find and reduce. For most queries, I would 
 think that 1 is still a reasonable number to cache (On the reduce side we 
 use 25000 for shuffle joins).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-15 Thread Gunther Hagleitner (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658727#comment-13658727
]

Gunther Hagleitner commented on HIVE-4440:
--

[~owen.omalley]: Ran all tests - no failures.

SMB Operator spills to disk like it's 1999
--

Key: HIVE-4440
URL: https://issues.apache.org/jira/browse/HIVE-4440
Project: Hive
Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-14 Thread Owen O'Malley (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657645#comment-13657645
]

Owen O'Malley commented on HIVE-4440:
-

Patch.2 looks good. I agree with Namit that supporting the old config makes
sense. Can you run tests on it. We might consider warning in Hive 0.13 and
removing in Hive 0.14. *smile*

SMB Operator spills to disk like it's 1999
--

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-05-02 Thread Gunther Hagleitner (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647395#comment-13647395
]

Gunther Hagleitner commented on HIVE-4440:
--

Thanks :-)

Patch .2 honors the old parameter unless it's at the default in which case it
uses the new one. I also put documentation around it.

You bring up a good point, but are you sure it's necessary to support both in
this case though? It's just slightly ugly in the code and requires us to move
in again to remove later. My thinking is this: If you use the old parameter,
it's probably because you needed to up it to get better performance - in this
case the new default should most likely be ok for you. Do you think there's
going to be cases where this falls flat?

SMB Operator spills to disk like it's 1999
--

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

2013-04-29 Thread Namit Jain (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645178#comment-13645178
]

Namit Jain commented on HIVE-4440:
--

I really like the title of the jira.

Changing the parameter name is backward incompatible.
Can you support both the current parameter and the proposed parameter for now ?
Document it clearly, and say that the current parameter
hive.mapjoin.bucket.cache.size will not be supported
for this from 0.13 or something like that.

SMB Operator spills to disk like it's 1999
--

Key: HIVE-4440
URL: https://issues.apache.org/jira/browse/HIVE-4440
Project: Hive
Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
Attachments: HIVE-4440.1.patch

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

9 matches

Site Navigation

Mail list logo

Footer information