i went through the jiras targeting 2.4.0 trying to find a feature where spark would coalesce/repartition by size (so merge small files automatically), but didn't find it. can someone point me to it? thank you. best, koert
On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers <ko...@tresata.com> wrote: > lukas, > what is the jira ticket for this? i would like to follow it's activity. > thanks! > koert > > On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> wrote: > >> Hi, >> Yes, This feature is planned - Spark should be soon able to repartition >> output by size. >> Lukas >> >> >> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <forest.f...@outlook.com> >> napsal: >> >>> Has there been any discussion to simply support Hive's merge small files >>> configuration? It simply adds one additional stage to inspect size of each >>> output file, recompute the desired parallelism to reach a target size, and >>> runs a map-only coalesce before committing the final files. Since AFAIK >>> SparkSQL already stages the final output commit, it seems feasible to >>> respect this Hive config. >>> >>> https://community.hortonworks.com/questions/106987/hive-mult >>> iple-small-files.html >>> >>> >>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> See some of the related discussion under https://github.com/apach >>>> e/spark/pull/21589 >>>> >>>> If feels to me like we need some kind of user code mechanism to signal >>>> policy preferences to Spark. This could also include ways to signal >>>> scheduling policy, which could include things like scheduling pool and/or >>>> barrier scheduling. Some of those scheduling policies operate at inherently >>>> different levels currently -- e.g. scheduling pools at the Job level >>>> (really, the thread local level in the current implementation) and barrier >>>> scheduling at the Stage level -- so it is not completely obvious how to >>>> unify all of these policy options/preferences/mechanism, or whether it is >>>> possible, but I think it is worth considering such things at a fairly high >>>> level of abstraction and try to unify and simplify before making things >>>> more complex with multiple policy mechanisms. >>>> >>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> Seems like a good idea in general. Do other systems have similar >>>>> concepts? In general it'd be easier if we can follow existing convention >>>>> if >>>>> there is any. >>>>> >>>>> >>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Many Spark users in my company are asking for a way to control the >>>>>> number of output files in Spark SQL. There are use cases to either reduce >>>>>> or increase the number. The users prefer not to use function >>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to >>>>>> write and deploy Scala/Java/Python code. >>>>>> >>>>>> Could we introduce a query hint for this purpose (similar to >>>>>> Broadcast Join Hints)? >>>>>> >>>>>> /*+ *COALESCE*(n, shuffle) */ >>>>>> >>>>>> In general, is query hint is the best way to bring DF functionality >>>>>> to SQL without extending SQL syntax? Any suggestion is highly >>>>>> appreciated. >>>>>> >>>>>> This requirement is not the same as SPARK-6221 that asked for >>>>>> auto-merging output files. >>>>>> >>>>>> Thanks, >>>>>> John Zhuge >>>>>> >>>>> >