lukas, what is the jira ticket for this? i would like to follow it's activity. thanks! koert
On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> wrote: > Hi, > Yes, This feature is planned - Spark should be soon able to repartition > output by size. > Lukas > > > Dne st 25. 7. 2018 23:26 uživatel Forest Fang <forest.f...@outlook.com> > napsal: > >> Has there been any discussion to simply support Hive's merge small files >> configuration? It simply adds one additional stage to inspect size of each >> output file, recompute the desired parallelism to reach a target size, and >> runs a map-only coalesce before committing the final files. Since AFAIK >> SparkSQL already stages the final output commit, it seems feasible to >> respect this Hive config. >> >> https://community.hortonworks.com/questions/106987/hive- >> multiple-small-files.html >> >> >> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >>> See some of the related discussion under https://github.com/ >>> apache/spark/pull/21589 >>> >>> If feels to me like we need some kind of user code mechanism to signal >>> policy preferences to Spark. This could also include ways to signal >>> scheduling policy, which could include things like scheduling pool and/or >>> barrier scheduling. Some of those scheduling policies operate at inherently >>> different levels currently -- e.g. scheduling pools at the Job level >>> (really, the thread local level in the current implementation) and barrier >>> scheduling at the Stage level -- so it is not completely obvious how to >>> unify all of these policy options/preferences/mechanism, or whether it is >>> possible, but I think it is worth considering such things at a fairly high >>> level of abstraction and try to unify and simplify before making things >>> more complex with multiple policy mechanisms. >>> >>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com> wrote: >>> >>>> Seems like a good idea in general. Do other systems have similar >>>> concepts? In general it'd be easier if we can follow existing convention if >>>> there is any. >>>> >>>> >>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Many Spark users in my company are asking for a way to control the >>>>> number of output files in Spark SQL. There are use cases to either reduce >>>>> or increase the number. The users prefer not to use function >>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to write >>>>> and deploy Scala/Java/Python code. >>>>> >>>>> Could we introduce a query hint for this purpose (similar to Broadcast >>>>> Join Hints)? >>>>> >>>>> /*+ *COALESCE*(n, shuffle) */ >>>>> >>>>> In general, is query hint is the best way to bring DF functionality to >>>>> SQL without extending SQL syntax? Any suggestion is highly appreciated. >>>>> >>>>> This requirement is not the same as SPARK-6221 that asked for >>>>> auto-merging output files. >>>>> >>>>> Thanks, >>>>> John Zhuge >>>>> >>>>