Re: [DISCUSS][SQL] Control the number of output files

lukas nalezenec Wed, 25 Jul 2018 14:33:01 -0700

Hi,
Yes, This feature is planned - Spark should be soon able to repartition
output by size.
Lukas



Dne st 25. 7. 2018 23:26 uživatel Forest Fang <[email protected]>
napsal:

> Has there been any discussion to simply support Hive's merge small files
> configuration? It simply adds one additional stage to inspect size of each
> output file, recompute the desired parallelism to reach a target size, and
> runs a map-only coalesce before committing the final files. Since AFAIK
> SparkSQL already stages the final output commit, it seems feasible to
> respect this Hive config.
>
>
> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>
>
> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <[email protected]>
> wrote:
>
>> See some of the related discussion under
>> https://github.com/apache/spark/pull/21589
>>
>> If feels to me like we need some kind of user code mechanism to signal
>> policy preferences to Spark. This could also include ways to signal
>> scheduling policy, which could include things like scheduling pool and/or
>> barrier scheduling. Some of those scheduling policies operate at inherently
>> different levels currently -- e.g. scheduling pools at the Job level
>> (really, the thread local level in the current implementation) and barrier
>> scheduling at the Stage level -- so it is not completely obvious how to
>> unify all of these policy options/preferences/mechanism, or whether it is
>> possible, but I think it is worth considering such things at a fairly high
>> level of abstraction and try to unify and simplify before making things
>> more complex with multiple policy mechanisms.
>>
>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <[email protected]> wrote:
>>
>>> Seems like a good idea in general. Do other systems have similar
>>> concepts? In general it'd be easier if we can follow existing convention if
>>> there is any.
>>>
>>>
>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Many Spark users in my company are asking for a way to control the
>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>> or increase the number. The users prefer not to use function
>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to write
>>>> and deploy Scala/Java/Python code.
>>>>
>>>> Could we introduce a query hint for this purpose (similar to Broadcast
>>>> Join Hints)?
>>>>
>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>
>>>> In general, is query hint is the best way to bring DF functionality to
>>>> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>
>>>> This requirement is not the same as SPARK-6221 that asked for
>>>> auto-merging output files.
>>>>
>>>> Thanks,
>>>> John Zhuge
>>>>
>>>

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to