Re: [DISCUSS][SQL] Control the number of output files

Koert Kuipers Mon, 06 Aug 2018 09:34:39 -0700

i went through the jiras targeting 2.4.0 trying to find a feature where
spark would coalesce/repartition by size (so merge small files
automatically), but didn't find it.
can someone point me to it?
thank you.
best,
koert


On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers <ko...@tresata.com> wrote:

> lukas,
> what is the jira ticket for this? i would like to follow it's activity.
> thanks!
> koert
>
> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> wrote:
>
>> Hi,
>> Yes, This feature is planned - Spark should be soon able to repartition
>> output by size.
>> Lukas
>>
>>
>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <forest.f...@outlook.com>
>> napsal:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>> https://community.hortonworks.com/questions/106987/hive-mult
>>> iple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> See some of the related discussion under https://github.com/apach
>>>> e/spark/pull/21589
>>>>
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>> if
>>>>> there is any.
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>>
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>
>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>> appreciated.
>>>>>>
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to