Re: [DISCUSS][SQL] Control the number of output files

Koert Kuipers Sun, 05 Aug 2018 18:07:29 -0700

lukas,
what is the jira ticket for this? i would like to follow it's activity.
thanks!
koert


On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <[email protected]> wrote:

> Hi,
> Yes, This feature is planned - Spark should be soon able to repartition
> output by size.
> Lukas
>
>
> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <[email protected]>
> napsal:
>
>> Has there been any discussion to simply support Hive's merge small files
>> configuration? It simply adds one additional stage to inspect size of each
>> output file, recompute the desired parallelism to reach a target size, and
>> runs a map-only coalesce before committing the final files. Since AFAIK
>> SparkSQL already stages the final output commit, it seems feasible to
>> respect this Hive config.
>>
>> https://community.hortonworks.com/questions/106987/hive-
>> multiple-small-files.html
>>
>>
>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <[email protected]>
>> wrote:
>>
>>> See some of the related discussion under https://github.com/
>>> apache/spark/pull/21589
>>>
>>> If feels to me like we need some kind of user code mechanism to signal
>>> policy preferences to Spark. This could also include ways to signal
>>> scheduling policy, which could include things like scheduling pool and/or
>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>> different levels currently -- e.g. scheduling pools at the Job level
>>> (really, the thread local level in the current implementation) and barrier
>>> scheduling at the Stage level -- so it is not completely obvious how to
>>> unify all of these policy options/preferences/mechanism, or whether it is
>>> possible, but I think it is worth considering such things at a fairly high
>>> level of abstraction and try to unify and simplify before making things
>>> more complex with multiple policy mechanisms.
>>>
>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <[email protected]> wrote:
>>>
>>>> Seems like a good idea in general. Do other systems have similar
>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>> there is any.
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Many Spark users in my company are asking for a way to control the
>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>> or increase the number. The users prefer not to use function
>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to write
>>>>> and deploy Scala/Java/Python code.
>>>>>
>>>>> Could we introduce a query hint for this purpose (similar to Broadcast
>>>>> Join Hints)?
>>>>>
>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>
>>>>> In general, is query hint is the best way to bring DF functionality to
>>>>> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>
>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>> auto-merging output files.
>>>>>
>>>>> Thanks,
>>>>> John Zhuge
>>>>>
>>>>

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to