[jira] [Commented] (HIVE-7685) Parquet memory manager

2015-01-06 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266632#comment-14266632
 ] 

Brock Noland commented on HIVE-7685:


bq. But should it be documented in Hive's wiki even though it's a Parquet 
parameter, since it's in HiveConf.java?

Yes, this was implemented specifically for Hive users who cannot easily control 
the number of partitions being written so I think it makes sense to doc in the 
hive-parquet docs...

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Fix For: 0.15.0

 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, 
 HIVE-7685.patch, HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2015-01-05 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265725#comment-14265725
 ] 

Lefty Leverenz commented on HIVE-7685:
--

Doc note:  This adds *parquet.memory.pool.ratio* to HiveConf.java.

bq.  This config parameter is defined in Parquet, so that it does not start 
with 'hive.'

But should it be documented in Hive's wiki even though it's a Parquet 
parameter, since it's in HiveConf.java?

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Fix For: 0.15.0

 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, 
 HIVE-7685.patch, HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-12-30 Thread Dong Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260929#comment-14260929
 ] 

Dong Chen commented on HIVE-7685:
-

The value is correctly passed down after verification.

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, 
 HIVE-7685.patch, HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-12-30 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261295#comment-14261295
 ] 

Brock Noland commented on HIVE-7685:


+1

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, 
 HIVE-7685.patch, HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-12-29 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260742#comment-14260742
 ] 

Brock Noland commented on HIVE-7685:


Thank you [~dongc]! I have not checked, are we are sure that the values in 
HiveConf are correctly passed down to the parquet writer? 

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, 
 HIVE-7685.patch, HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-12-29 Thread Dong Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260763#comment-14260763
 ] 

Dong Chen commented on HIVE-7685:
-

Brock, thanks for your quick feedback! 
Yes, it is passed down, and verified in the Hive + Parquet integration env: 
1. check the value in log; 
2. use about 2G data with 5 partition to insert. It works fine with this change 
and OOM without the change.

Since the check was done several days ago, I will double check it today and see 
the result.

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, 
 HIVE-7685.patch, HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-12-29 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260822#comment-14260822
 ] 

Hive QA commented on HIVE-7685:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12689425/HIVE-7685.1.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 6723 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_list_bucket_dml_10
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2217/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12689425 - PreCommit-HIVE-TRUNK-Build

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.1.patch, HIVE-7685.1.patch.ready, 
 HIVE-7685.patch, HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-12-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239367#comment-14239367
 ] 

Hive QA commented on HIVE-7685:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12685972/HIVE-7685.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 6699 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_1
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2006/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12685972 - PreCommit-HIVE-TRUNK-Build

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.1.patch.ready, HIVE-7685.patch, 
 HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-11-11 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207758#comment-14207758
 ] 

Ferdinand Xu commented on HIVE-7685:


Hi,
I am afraid the file name extension .ready may not be able to trigger the 
hive qa CI test. Better change to *.patch.

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-11-11 Thread Dong Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207765#comment-14207765
 ] 

Dong Chen commented on HIVE-7685:
-

Thanks for this reminding. :)
Since this patch could not pass building right now because of depending on 
PARQUET-108 resolved, I will rename it to trigger test later.

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland
Assignee: Dong Chen
 Attachments: HIVE-7685.patch.ready


 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-09-29 Thread Dong Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151496#comment-14151496
 ] 

Dong Chen commented on HIVE-7685:
-

Hi Brock,

I think a brief design for this memory manager is:
Every new writer registers itself to the manager. The manager has an overall 
view of all the writers. When a condition is up (such as every 1000 rows), it 
will notify the writers to check memory usage and flush if necessary.

However, a problem for Parquet specifically is: Hive only has a wrapper for the 
ParquetRecordWriter, and even ParquetRecordWriter also wrap the real writer 
(InternalParquetRecordWriter) in Parquet project. Since the behaviors of 
measuring dynamic buffer size and flushing are private in the real writer, I 
think we also have to add code in InternalParquetRecordWriter to implement the 
memory manager functionality. 

It seems only changing Hive code cannot fix this Jira. 
Not sure whether we should put this problem in Parquet project and fix it 
there, if it is generic enough and not Hive specific? 

Any other ideas?

Best Regards,
Dong

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland

 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-09-29 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151792#comment-14151792
 ] 

Brock Noland commented on HIVE-7685:


Hi Dong,

Ok, thank you for the investigation. I think we can either put the parquet 
memory manager in Parquet or add API's to expose the information required to 
implement the memory manager in HIve. Either approach is fine by me, we can 
take this work up in PARQUET-108.

Brock

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland

 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7685) Parquet memory manager

2014-09-29 Thread Dong Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152708#comment-14152708
 ] 

Dong Chen commented on HIVE-7685:
-

Sure, I will take PARQUET-108 and put the manager in Parquet.

 Parquet memory manager
 --

 Key: HIVE-7685
 URL: https://issues.apache.org/jira/browse/HIVE-7685
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Brock Noland

 Similar to HIVE-4248, Parquet tries to write large very large row groups. 
 This causes Hive to run out of memory during dynamic partitions when a 
 reducer may have many Parquet files open at a given time.
 As such, we should implement a memory manager which ensures that we don't run 
 out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)