[
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zenglinxi updated SPARK-14974:
------------------------------
Description:
Recently, we often encounter problems using spark sql for inserting data into a
partition table (ex.: insert overwrite table $output_table partition(dt) select
xxx from tmp_table).
After the spark job start running on yarn, the app will create too many files
(ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous
pressure.
We found that the num of files created by spark job is depending on the
partition num of hive table that will be inserted and the num of spark sql
partitions.
files_num = hive_table_partions_num * spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >=
1000, and the hive_table_partions_num is very small under normal circumstances,
but it will turn out to be more than 2000 when we input a wrong field as the
partion field unconsciously, which will make the files_num >= 1000 * 2000 =
2,000,000.
There is a configuration parameter in hive that can limit the maximum number of
dynamic partitions allowed to be created in each mapper/reducer named
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work
when we use hiveContext.
Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the
files_num be smaller, but it will affect the concurrency.
Can we create configuration parameters to limit the maximum number of files
allowed to be create by each task or limit the spark_sql_partitions_num without
affect the concurrency?
was:
Recently, we often encounter problems using spark sql for inserting data into a
partition table (ex.: insert overwrite table $output_table partition(dt) select
xxx from tmp_table).
After the spark job start running on yarn, the app will create too many files
(ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous
pressure.
We found that the num of files created by spark job is depending on the
partition num of hive table that will be inserted and the num of spark sql
partitions.
files_num = hive_table_partions_num * spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >=
1000, and the hive_table_partions_num is very small under normal circumstances,
but it will turn out to be more than 2000 when we input a wrong field as the
partion field unconsciously, which will make the files_num >= 1000 * 2000 =
2,000,000.
There is a configuration parameter in hive that can limit the maximum number of
dynamic partitions allowed to be created in each mapper/reducer named
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work
when we use hiveContext.
Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the
files_num be smaller, but it will affect the concurrency.
> spark sql job create too many files in HDFS when doing insert overwrite hive
> table
> ----------------------------------------------------------------------------------
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
> Issue Type: Task
> Components: SQL
> Affects Versions: 1.5.2
> Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into
> a partition table (ex.: insert overwrite table $output_table partition(dt)
> select xxx from tmp_table).
> After the spark job start running on yarn, the app will create too many files
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous
> pressure.
> We found that the num of files created by spark job is depending on the
> partition num of hive table that will be inserted and the num of spark sql
> partitions.
> files_num = hive_table_partions_num * spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >=
> 1000, and the hive_table_partions_num is very small under normal
> circumstances, but it will turn out to be more than 2000 when we input a
> wrong field as the partion field unconsciously, which will make the files_num
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number
> of dynamic partitions allowed to be created in each mapper/reducer named
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to limit the maximum number of files
> allowed to be create by each task or limit the spark_sql_partitions_num
> without affect the concurrency?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]