[
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zenglinxi updated SPARK-14974:
------------------------------
Summary: spark sql job create too many files in HDFS when doing insert
overwrite hive table (was: spark sql insert hive table write too many files)
> spark sql job create too many files in HDFS when doing insert overwrite hive
> table
> ----------------------------------------------------------------------------------
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
> Issue Type: Task
> Components: SQL
> Affects Versions: 1.5.2
> Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into
> a partition table (ex.: insert overwrite table $output_table partition(dt)
> select xxx from tmp_table).
> After the spark job start running on yarn, the app will create too many files
> (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
> We found that the num of files created by spark job is depending on the
> partition num of hive table that will be inserted and the num of spark sql
> partitions.
> files_num = hive_table_partions_num * spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >=
> 1000, and the hive_table_partions_num is very small under normal
> circumstances, but it will turn out to be more than 2000 when we input a
> wrong field as the partion field unconsciously, which will make the files_num
> >= 1000 * 2000 = 200w.
> There is a configuration parameter in hive that can limit the maximum number
> of dynamic partitions allowed to be created in each mapper/reducer named
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the
> files_num be smaller, but it will affect the concurrency.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]