Re: [DISCUSS][SQL] Control the number of output files

Mark Hamstra Wed, 25 Jul 2018 13:56:23 -0700

See some of the related discussion under
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal
policy preferences to Spark. This could also include ways to signal
scheduling policy, which could include things like scheduling pool and/or
barrier scheduling. Some of those scheduling policies operate at inherently
different levels currently -- e.g. scheduling pools at the Job level
(really, the thread local level in the current implementation) and barrier
scheduling at the Stage level -- so it is not completely obvious how to
unify all of these policy options/preferences/mechanism, or whether it is
possible, but I think it is worth considering such things at a fairly high
level of abstraction and try to unify and simplify before making things
more complex with multiple policy mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <[email protected]> wrote:

> Seems like a good idea in general. Do other systems have similar concepts?
> In general it'd be easier if we can follow existing convention if there is
> any.
>
>
> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <[email protected]> wrote:
>
>> Hi all,
>>
>> Many Spark users in my company are asking for a way to control the number
>> of output files in Spark SQL. There are use cases to either reduce or
>> increase the number. The users prefer not to use function *repartition*(n)
>> or *coalesce*(n, shuffle) that require them to write and deploy
>> Scala/Java/Python code.
>>
>> Could we introduce a query hint for this purpose (similar to Broadcast
>> Join Hints)?
>>
>>     /*+ *COALESCE*(n, shuffle) */
>>
>> In general, is query hint is the best way to bring DF functionality to
>> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>
>> This requirement is not the same as SPARK-6221 that asked for
>> auto-merging output files.
>>
>> Thanks,
>> John Zhuge
>>
>

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to