[jira] [Commented] (HIVE-10149) Shuffle Hive data before storing in Parquet

JIRA Mon, 30 Mar 2015 15:03:30 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387501#comment-14387501
 ]


Sergio Peña commented on HIVE-10149:
------------------------------------

HIVE-7685 was working on Parquet memory manager to avoid this issue, but this 
is still happening on the current 1.2.0 (devel) branch.

> Shuffle Hive data before storing in Parquet
> -------------------------------------------
>
>                 Key: HIVE-10149
>                 URL: https://issues.apache.org/jira/browse/HIVE-10149
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 1.1.0
>            Reporter: Sergio Peña
>         Attachments: data.txt
>
>
> Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic 
> partitions to parquet because it creates too many open files at once and 
> Parquet buffers an entire row group of data in memory for each open file. To 
> avoid this in ORC, HIVE-6455 shuffles data for each partition so only one 
> file is open at a time. We need to extend this support to Parquet and 
> possibly the MR and Spark planners.
> Steps to reproduce:
> 1. Create a table and load some data that contains many many partitions (file 
> {{data.txt}} attached on this ticket).
> {code}
> hive> create table t1_stage(id bigint, rdate string) row format delimited 
> fields terminated by ' ';
> hive> load data local inpath 'data.txt' into table t1_stage;
> {code}
> 2. Create a Parquet table, and insert partitioned data from the t1_stage 
> table.
> {noformat}
> hive> set hive.exec.dynamic.partition.mode=nonstrict;
> hive> create table t1_part(id bigint) partitioned by (rdate string) stored as 
> parquet;
> hive> insert overwrite table t1_part partition(rdate) select * from t1_stage;
> Query ID = sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1427748520315_0006, Tracking URL = 
> http://victory:8088/proxy/application_1427748520315_0006/
> Kill Command = /opt/local/hadoop/bin/hadoop job  -kill job_1427748520315_0006
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2015-03-30 16:37:19,065 Stage-1 map = 0%,  reduce = 0%
> 2015-03-30 16:37:43,947 Stage-1 map = 100%,  reduce = 0%
> Ended Job = job_1427748520315_0006 with errors
> Error during job, obtaining debugging information...
> Examining task ID: task_1427748520315_0006_m_000000 (and more) from job 
> job_1427748520315_0006
> Task with the most failures(4): 
> -----
> Task ID:
>   task_1427748520315_0006_m_000000
> URL:
>   
> http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1427748520315_0006&tipid=task_1427748520315_0006_m_000000
> -----
> Diagnostic Messages for this Task:
> Error: Java heap space
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
> Total MapReduce CPU Time Spent: 0 msec
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10149) Shuffle Hive data before storing in Parquet

Reply via email to