[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266632#comment-14266632 ] Brock Noland commented on HIVE-7685: bq. But should it be documented in Hive's wiki even though it's a Parquet parameter, since it's in HiveConf.java? Yes, this was implemented specifically for Hive users who cannot easily control the number of partitions being written so I think it makes sense to doc in the hive-parquet docs... Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Fix For: 0.15.0 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265725#comment-14265725 ] Lefty Leverenz commented on HIVE-7685: -- Doc note: This adds *parquet.memory.pool.ratio* to HiveConf.java. bq. This config parameter is defined in Parquet, so that it does not start with 'hive.' But should it be documented in Hive's wiki even though it's a Parquet parameter, since it's in HiveConf.java? Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Fix For: 0.15.0 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260929#comment-14260929 ] Dong Chen commented on HIVE-7685: - The value is correctly passed down after verification. Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261295#comment-14261295 ] Brock Noland commented on HIVE-7685: +1 Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260742#comment-14260742 ] Brock Noland commented on HIVE-7685: Thank you [~dongc]! I have not checked, are we are sure that the values in HiveConf are correctly passed down to the parquet writer? Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260763#comment-14260763 ] Dong Chen commented on HIVE-7685: - Brock, thanks for your quick feedback! Yes, it is passed down, and verified in the Hive + Parquet integration env: 1. check the value in log; 2. use about 2G data with 5 partition to insert. It works fine with this change and OOM without the change. Since the check was done several days ago, I will double check it today and see the result. Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260822#comment-14260822 ] Hive QA commented on HIVE-7685: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689425/HIVE-7685.1.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 6723 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_list_bucket_dml_10 {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2217/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12689425 - PreCommit-HIVE-TRUNK-Build Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239367#comment-14239367 ] Hive QA commented on HIVE-7685: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12685972/HIVE-7685.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 6699 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_1 {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2006/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12685972 - PreCommit-HIVE-TRUNK-Build Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.1.patch.ready, HIVE-7685.patch, HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207758#comment-14207758 ] Ferdinand Xu commented on HIVE-7685: Hi, I am afraid the file name extension .ready may not be able to trigger the hive qa CI test. Better change to *.patch. Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207765#comment-14207765 ] Dong Chen commented on HIVE-7685: - Thanks for this reminding. :) Since this patch could not pass building right now because of depending on PARQUET-108 resolved, I will rename it to trigger test later. Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Assignee: Dong Chen Attachments: HIVE-7685.patch.ready Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151496#comment-14151496 ] Dong Chen commented on HIVE-7685: - Hi Brock, I think a brief design for this memory manager is: Every new writer registers itself to the manager. The manager has an overall view of all the writers. When a condition is up (such as every 1000 rows), it will notify the writers to check memory usage and flush if necessary. However, a problem for Parquet specifically is: Hive only has a wrapper for the ParquetRecordWriter, and even ParquetRecordWriter also wrap the real writer (InternalParquetRecordWriter) in Parquet project. Since the behaviors of measuring dynamic buffer size and flushing are private in the real writer, I think we also have to add code in InternalParquetRecordWriter to implement the memory manager functionality. It seems only changing Hive code cannot fix this Jira. Not sure whether we should put this problem in Parquet project and fix it there, if it is generic enough and not Hive specific? Any other ideas? Best Regards, Dong Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151792#comment-14151792 ] Brock Noland commented on HIVE-7685: Hi Dong, Ok, thank you for the investigation. I think we can either put the parquet memory manager in Parquet or add API's to expose the information required to implement the memory manager in HIve. Either approach is fine by me, we can take this work up in PARQUET-108. Brock Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7685) Parquet memory manager
[ https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152708#comment-14152708 ] Dong Chen commented on HIVE-7685: - Sure, I will take PARQUET-108 and put the manager in Parquet. Parquet memory manager -- Key: HIVE-7685 URL: https://issues.apache.org/jira/browse/HIVE-7685 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Brock Noland Similar to HIVE-4248, Parquet tries to write large very large row groups. This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time. As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)