[jira] [Comment Edited] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

Yiting Shan (JIRA) Wed, 01 Nov 2017 17:57:26 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235027#comment-16235027
 ]


Yiting Shan edited comment on SPARK-14974 at 11/2/17 12:56 AM:
---------------------------------------------------------------

I am seeing similar issue. Insert overwrite to dynamic partitioned hive table 
through spark SQL is creating tons of small files which is extremely slow. 
Would echo to make a fix to this issue.

The SQL we are using is like below which cannot finish in 12 hours and 
generated > 10,000 small intermediate partition folders

{code:sql}
insert overwrite table mydb.final_table
partition
(p_consumerid)
select
consumerid STRING,
p_consumerid INT
from tmp_table
{code}

But using DataFrame API to overwrite Hive table can complete in half an hour 
and should only generate 16 partitions in the end:

{code:scala}
df.write
          .format("parquet")
          .option("compression", "snappy")
          .mode(SaveMode.Overwrite)
          .saveAsTable("final_table")
{code}

PS. Spark version I am using is 2.1.1


was (Author: ussraf):
I am seeing similar issue. Insert overwrite to dynamic partitioned hive table 
through spark SQL is creating tons of small files which is extremely slow. 
Would echo to make a fix to this issue.

The SQL we are using is like below which cannot finish in 12 hours and 
generated > 10,000 small intermediate partition folders

{code:sql}
insert overwrite table mydb.final_table
partition
(p_consumerid)
select
consumerid STRING,
p_consumerid INT
from tmp_table
{code}

But using DataFrame API to overwrite Hive table can complete in half an hour 
and should only generate 16 partitions in the end:

{code:scala}
df.write
          .format("parquet")
          .option("compression", "snappy")
          .mode(SaveMode.Overwrite)
          .saveAsTable("final_table")
{code}


> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14974
>                 URL: https://issues.apache.org/jira/browse/SPARK-14974
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: zenglinxi
>            Priority: Minor
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files 
> allowed to be create by each task or limit the spark_sql_partitions_num 
> without affect the concurrency?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

Reply via email to