[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

zenglinxi (JIRA) Thu, 28 Apr 2016 04:51:03 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zenglinxi updated SPARK-14974:
------------------------------
    Description: 
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
2,000,000.

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
files_num be smaller, but it will affect the concurrency.

Can we create configuration parameters to  limit the maximum number of files 
allowed to be create by each task or limit the spark_sql_partitions_num without 
affect the concurrency?

  was:
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
2,000,000.

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
files_num be smaller, but it will affect the concurrency.


> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14974
>                 URL: https://issues.apache.org/jira/browse/SPARK-14974
>             Project: Spark
>          Issue Type: Task
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files 
> allowed to be create by each task or limit the spark_sql_partitions_num 
> without affect the concurrency?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

Reply via email to