[
https://issues.apache.org/jira/browse/HIVE-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387501#comment-14387501
]
Sergio Peña commented on HIVE-10149:
------------------------------------
HIVE-7685 was working on Parquet memory manager to avoid this issue, but this
is still happening on the current 1.2.0 (devel) branch.
> Shuffle Hive data before storing in Parquet
> -------------------------------------------
>
> Key: HIVE-10149
> URL: https://issues.apache.org/jira/browse/HIVE-10149
> Project: Hive
> Issue Type: Improvement
> Affects Versions: 1.1.0
> Reporter: Sergio Peña
> Attachments: data.txt
>
>
> Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic
> partitions to parquet because it creates too many open files at once and
> Parquet buffers an entire row group of data in memory for each open file. To
> avoid this in ORC, HIVE-6455 shuffles data for each partition so only one
> file is open at a time. We need to extend this support to Parquet and
> possibly the MR and Spark planners.
> Steps to reproduce:
> 1. Create a table and load some data that contains many many partitions (file
> {{data.txt}} attached on this ticket).
> {code}
> hive> create table t1_stage(id bigint, rdate string) row format delimited
> fields terminated by ' ';
> hive> load data local inpath 'data.txt' into table t1_stage;
> {code}
> 2. Create a Parquet table, and insert partitioned data from the t1_stage
> table.
> {noformat}
> hive> set hive.exec.dynamic.partition.mode=nonstrict;
> hive> create table t1_part(id bigint) partitioned by (rdate string) stored as
> parquet;
> hive> insert overwrite table t1_part partition(rdate) select * from t1_stage;
> Query ID = sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1427748520315_0006, Tracking URL =
> http://victory:8088/proxy/application_1427748520315_0006/
> Kill Command = /opt/local/hadoop/bin/hadoop job -kill job_1427748520315_0006
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2015-03-30 16:37:19,065 Stage-1 map = 0%, reduce = 0%
> 2015-03-30 16:37:43,947 Stage-1 map = 100%, reduce = 0%
> Ended Job = job_1427748520315_0006 with errors
> Error during job, obtaining debugging information...
> Examining task ID: task_1427748520315_0006_m_000000 (and more) from job
> job_1427748520315_0006
> Task with the most failures(4):
> -----
> Task ID:
> task_1427748520315_0006_m_000000
> URL:
>
> http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1427748520315_0006&tipid=task_1427748520315_0006_m_000000
> -----
> Diagnostic Messages for this Task:
> Error: Java heap space
> FAILED: Execution Error, return code 2 from
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
> Total MapReduce CPU Time Spent: 0 msec
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)