[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742863#comment-13742863
 ] 

Hudson commented on HIVE-4838:
--

ABORTED: Integrated in Hive-trunk-hadoop2 #365 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/365/])
HIVE-4838 : Refactor MapJoin HashMap code to improve testability and 
readability (Brock Noland via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1514760)
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java


 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be 

[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-16 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742184#comment-13742184
 ] 

Brock Noland commented on HIVE-4838:


Done, looks like the last build had a connection error to source control.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742310#comment-13742310
 ] 

Hive QA commented on HIVE-4838:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12598209/HIVE-4838.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 2884 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_udtf_not_supported2
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/463/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/463/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests failed with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-16 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742316#comment-13742316
 ] 

Brock Noland commented on HIVE-4838:


That test has been failing since commit. I believe Gunther asked someone to 
look at it.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-16 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742343#comment-13742343
 ] 

Brock Noland commented on HIVE-4838:


Thanks!! I have opened HIVE-5110 to look at the memory consumption stuff we 
discussed.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742550#comment-13742550
 ] 

Hudson commented on HIVE-4838:
--

FAILURE: Integrated in Hive-trunk-hadoop2-ptest #61 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/61/])
HIVE-4838 : Refactor MapJoin HashMap code to improve testability and 
readability (Brock Noland via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1514760)
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java


 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units 

[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742622#comment-13742622
 ] 

Hudson commented on HIVE-4838:
--

FAILURE: Integrated in Hive-trunk-hadoop1-ptest #130 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/130/])
HIVE-4838 : Refactor MapJoin HashMap code to improve testability and 
readability (Brock Noland via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1514760)
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java


 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units 

[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742851#comment-13742851
 ] 

Hudson commented on HIVE-4838:
--

FAILURE: Integrated in Hive-trunk-h0.21 #2273 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2273/])
HIVE-4838 : Refactor MapJoin HashMap code to improve testability and 
readability (Brock Noland via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1514760)
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java


 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be 

[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-15 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741882#comment-13741882
 ] 

Ashutosh Chauhan commented on HIVE-4838:


[~brocknoland] Can you trigger HIVE QA run for this?

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737471#comment-13737471
 ] 

Ashutosh Chauhan commented on HIVE-4838:


Good work Brock. Left some comments on phabricator. Another question is it 
seems like there are few file mvs? To preserve history how shall we proceed 
about applying this patch on trunk.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-12 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737482#comment-13737482
 ] 

Brock Noland commented on HIVE-4838:


Sounds good, I will address them. In regards to the moves, I don't believe 
there are any true mv's. MapJoinObjectKey - MapJoinKey is kind of a move but 
I'd say it's more of complete re-implementation.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737492#comment-13737492
 ] 

Ashutosh Chauhan commented on HIVE-4838:


Ok. Sounds good.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735785#comment-13735785
 ] 

Ashutosh Chauhan commented on HIVE-4838:


[~brocknoland] Lets get this in, before the patch gets stale.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-10 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735950#comment-13735950
 ] 

Brock Noland commented on HIVE-4838:


Good call, I will make the change tonight and update a new patch.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-10 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736168#comment-13736168
 ] 

Hive QA commented on HIVE-4838:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12597317/HIVE-4838.patch

{color:green}SUCCESS:{color} +1 2779 tests passed

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/387/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/387/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732293#comment-13732293
 ] 

Ashutosh Chauhan commented on HIVE-4838:


Actually memory monitoring I was talking of was about local task which 
generates hashtable which happens locally on client. To generate a hashtable 
(which is then ship to task nodes) we launch local job on client in separate 
process. Logic of memory management for this local task is convoluted (not of 
MR job which actually does the join in mapper). This local task monitors its 
own memory, but seems like MapredLocalTask is catching OOM exception anyways. 
One of this is not required. My thinking is there shouldn't be any memory 
monitoring and we should just catch OOM exception when it fails. Anyways join 
is converted into mapjoin only when size of small table is small (governed by 
config knob), so this OOM should be very very rare. So, my suggestion is to 
remove MemoryHandler altogether.

ORC memory manger won't be a problem here, since ORC makes use of memory 
manager only while writing data and here we are dumping hashtable in java 
serialized format, so that wont be relevant. For similar reason (that this is 
local task) java.opts and io.sort.mb arent relevant either. 

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-07 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732571#comment-13732571
 ] 

Brock Noland commented on HIVE-4838:


What I was saying is the the local task JVM could be of different size than the 
mapred.child.java.opts on the server. I haven't heard of people hitting this 
much so it must not be too much of an issue. Good to know the ORC stuff is only 
used on write so it won't be an issue.

I am fine with removing the memory handling and using OOM. I think that I will 
allocate a buffer of say 1MB and then when the OOM is hit free that buffer so 
we can cleanly exit and log.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732597#comment-13732597
 ] 

Ashutosh Chauhan commented on HIVE-4838:


bq. I am fine with removing the memory handling and using OOM. I think that I 
will allocate a buffer of say 1MB and then when the OOM is hit free that buffer 
so we can cleanly exit and log.

Sounds good. Lets proceed with that. Though, I belief 256KB should be more than 
sufficient to generate exception and cleanly exit.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726497#comment-13726497
 ] 

Ashutosh Chauhan commented on HIVE-4838:


bq. The current code is using this static code because by using java 
serialization there is no way to pass any context information down to the 
class when the read/write methods are being called. In the new patch I define 
my own read/write methods 

By tracking metadata info per key, will it going to increase the size of 
hashtable? 
Earlier, metadata info is passed as one blob and loaded statically which can be 
looked by every key. Agreed it is not the clean way of doing it, but now this 
patch is storing metadata info per key, looks like this will increase the size 
of hashtable.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-01 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726507#comment-13726507
 ] 

Brock Noland commented on HIVE-4838:


Hey can you explain a little bit more? We aren't writing out the metadata 
per-key or anything like that, we are passing the metadata down into new 
read/write methods. AFAICT the current approach did the static stuff because 
they were using the Externalizable interface which didn't not allow any 
push-down metadata during seralization or deserialization. If you look at 
MapJoinTableContainer read and write in the new patch you'll see us pushing the 
metadata (called context) down into the *new* read/write methods and the 
corresponding read/write methods are not serializing that metadata.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726791#comment-13726791
 ] 

Ashutosh Chauhan commented on HIVE-4838:


Yeah I misunderstood that piece. Another question : I see that you have 
improved memory handling. But I am confused why do we need to monitor memory 
usage here anyway? This predate your patch so question here really is do we 
need memory handler here? It seems it was put in place so that we can 
proactively kill local task before it throws OOM. But since MapRedLocalTask 
anyways catches OOM exception, it seems like even if local task didnt kill 
itself before OOM'ing, we should be fine since MapRedLocalTask will take care 
of OOM exception.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-08-01 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13727102#comment-13727102
 ] 

Brock Noland commented on HIVE-4838:


I guess we could go that route. My thought was that the memory consumption was 
monitored to be conservative? I've always wondered about this. I mean if an 
admin sets mapred.child.java.opts and io.sort.mb final on the cluster the 
settings we are using from a client perspective could be completely different 
therefore it's possible it works locally but fails on the cluster. Another 
question I had about this is that ORC has a memory manager that assumes it can 
use a certain percentage of ram but that could conflict with our work here? 
That is the ORC memory manager could use memory while creating the hash table 
that we won't use when reading the hash table?

Additionally I thought it might make sense to only store offsets into a side 
file in the hash map to reduce memory consumption and then throw say a 25MB LRU 
cache on lookups into the file. Since the file is small it should be in OS 
buffer cache when not in the LRU cache.

Maybe we should take up memory management during map joins in another jira?

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-29 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723403#comment-13723403
 ] 

Edward Capriolo commented on HIVE-4838:
---

Hey, I think I may have mistakenly come to the conclusion that 
https://issues.apache.org/jira/browse/HIVE-2906
Passed tests when it did not. We might be best off reverting 2906 if it is a 
problem.


 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
 HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-23 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717703#comment-13717703
 ] 

Ashutosh Chauhan commented on HIVE-4838:


[~brocknoland] One of the item listed in description is:
* Uses static state via the MapJoinMetaData class to pass serialization 
metadata to the Key, Row classes.

Have you attacked this in this patch? If yes, how did you fix it. I haven't 
dived into the patch to figure that out yet.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-23 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717716#comment-13717716
 ] 

Brock Noland commented on HIVE-4838:


Hey,

Yes I have. I'll upload an updated patch here in a few minutes. The current 
code is using this static code because by using java serialization there is no 
way to pass any context information down to the class when the read/write 
methods are being called. In the new patch I define my own read/write methods 
(example below).

{noformat}
public void read(MapJoinObjectSerDeContext context, ObjectInputStream in, 
Writable container) 
throws IOException, SerDeException {
{noformat}

and use those to serialize/deserialize the objects. Specifically in the new 
patch MapJoinRowContainer.read/write, MapJoinTableContainerSerDe.load/persist 
and MapJoinKey.read/write will be interesting.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-23 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717719#comment-13717719
 ] 

Brock Noland commented on HIVE-4838:


Updated review https://reviews.facebook.net/D11679

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-13 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707751#comment-13707751
 ] 

Brock Noland commented on HIVE-4838:


Correct I believe this to only affect the null safe operator.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707154#comment-13707154
 ] 

Ashutosh Chauhan commented on HIVE-4838:


I see there is an update to .q.out file. Does that mean there is a correctness 
issue in existing code ?

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707181#comment-13707181
 ] 

Brock Noland commented on HIVE-4838:


Hi,

Correct there is. It's related to the snippet of code I posted earlier. 
Basically the equals implementation of MapJoinDoubleKey (and MapJoinObjectKey) 
is incorrect resulting in different results for the following query depending 
on how it executed (map-side vs reduce-side):

{noformat}
SELECT /*+ MAPJOIN(a) */ * FROM smb_input1 a JOIN smb_input1 b ON a.key = 
b.key AND a.value = b.value ORDER BY a.key, a.value, b.key, b.value;
{noformat}

Brock

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707205#comment-13707205
 ] 

Ashutosh Chauhan commented on HIVE-4838:


Interesting. Lets tease out that part from refactoring than. We need to fix 
correctness issue first. Can you create a separate jira with this issue and 
submit a minimal patch which fixes it.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707213#comment-13707213
 ] 

Brock Noland commented on HIVE-4838:


Fair enough, I'll have a patch for HIVE-4845 shortly.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707239#comment-13707239
 ] 

Edward Capriolo commented on HIVE-4838:
---

So which version is correct the map join or the map reduce join. Or were Both 
producing the wrong results?

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707269#comment-13707269
 ] 

Brock Noland commented on HIVE-4838:


Map-side is wrong and reduce-side was correct. For that query, on the map side, 
rows which should be joined are not. For example, the reduce side outputs this 
row:

{noformat}
a.key   a.value   b.key   b.value
148 NULL  148 NULL
{noformat}

which makes sense since a.key is equal to b.key and a.value is equal to b.value 
but the current map-side code omits this row. The reason is that 
MapJoinDoubleKey is used for the map-side join which doesn't properly compare 
null values.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707318#comment-13707318
 ] 

Edward Capriolo commented on HIVE-4838:
---

This is pretty sad news. How long has map-side join been broken for?

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707332#comment-13707332
 ] 

Brock Noland commented on HIVE-4838:


I think the equals method has been broken since HIVE-1754 but as far as I can 
tell it only affects joins with nulls in the join keys.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707592#comment-13707592
 ] 

Yin Huai commented on HIVE-4838:


Hi Brock, I have a question. Does this correctness issue only affect joins with 
= operator? Or it also affects = operator?

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707658#comment-13707658
 ] 

Yin Huai commented on HIVE-4838:


From the code, seems this issue only affects = operator.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-4838.patch, HIVE-4838.patch


 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704832#comment-13704832
 ] 

Ashutosh Chauhan commented on HIVE-4838:


I am glad you are taking a stab at this Brock. I looked at it couple of days 
and immediately felt the need for refactor. I was looking at it from 
performance point of view. There are couple of things which are worth 
considering in this refactor. 
* We are using java serialization to serialize the hash table. If we use some 
custom serialization we can possibly increase both memory efficiency as well as 
speed for this piece of code.
* Keys  values of the map are wrapper java objects, if we can use better data 
structures that will be further win.

I am just putting up as thoughts which came to my mind in 15 mins perusal of 
that class. Feel free to ignore them for now, we can take these latter once 
this basic cleanup is in.

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland

 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm, MapJoinSingleKey, and 
 MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

2013-07-10 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13705401#comment-13705401
 ] 

Brock Noland commented on HIVE-4838:


Hey thanks for the feedback!

Yes I thought about those items as well. I have a patch just about ready, which 
I'd like to get in before the optimizations since it fixes some correctness 
bugs but I'd love to per-sue those two items in a follow up jira.  For example, 
the following code produces unexpected results :)

{noformat}
  public static void main(String[] args) {
MapJoinDoubleKeys left = new MapJoinDoubleKeys(148, null);
MapJoinDoubleKeys right = new MapJoinDoubleKeys(148, null);
System.out.println(left.equals(right));
MapJoinObjectKey left = new MapJoinObjectKey(new Object[]{null, left});
MapJoinObjectKey right = new MapJoinObjectKey(new Object[]{null, right});
System.out.println(left.equals(right));
  }
{noformat}

 Refactor MapJoin HashMap code to improve testability and readability
 

 Key: HIVE-4838
 URL: https://issues.apache.org/jira/browse/HIVE-4838
 Project: Hive
  Issue Type: Bug
Reporter: Brock Noland
Assignee: Brock Noland

 MapJoin is an essential component for high performance joins in Hive and the 
 current code has done great service for many years. However, the code is 
 showing it's age and currently suffers  from the following issues:
 * Uses static state via the MapJoinMetaData class to pass serialization 
 metadata to the Key, Row classes.
 * The api of a logical Table Container is not defined and therefore it's 
 unclear what apis HashMapWrapper 
 needs to publicize. Additionally HashMapWrapper has many used public methods.
 * HashMapWrapper contains logic to serialize, test memory bounds, and 
 implement the table container. Ideally these logical units could be seperated
 * HashTableSinkObjectCtx has unused fields and unused methods
 * CommonJoinOperator and children use ArrayList on left hand side when only 
 List is required
 * There are unused classes MRU, DCLLItemm and classes which duplicate 
 functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira