[DISCUSS][SQL] Control the number of output files

John Zhuge Wed, 25 Jul 2018 11:50:35 -0700

Hi all,

Many Spark users in my company are asking for a way to control the number
of output files in Spark SQL. There are use cases to either reduce or
increase the number. The users prefer not to use function *repartition*(n)
or *coalesce*(n, shuffle) that require them to write and deploy
Scala/Java/Python code.


Could we introduce a query hint for this purpose (similar to Broadcast Join
Hints)?

    /*+ *COALESCE*(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging
output files.

Thanks,
John Zhuge

[DISCUSS][SQL] Control the number of output files

Reply via email to