Hi, Yes, This feature is planned - Spark should be soon able to repartition output by size. Lukas
Dne st 25. 7. 2018 23:26 uživatel Forest Fang <forest.f...@outlook.com> napsal: > Has there been any discussion to simply support Hive's merge small files > configuration? It simply adds one additional stage to inspect size of each > output file, recompute the desired parallelism to reach a target size, and > runs a map-only coalesce before committing the final files. Since AFAIK > SparkSQL already stages the final output commit, it seems feasible to > respect this Hive config. > > > https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html > > > On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com> > wrote: > >> See some of the related discussion under >> https://github.com/apache/spark/pull/21589 >> >> If feels to me like we need some kind of user code mechanism to signal >> policy preferences to Spark. This could also include ways to signal >> scheduling policy, which could include things like scheduling pool and/or >> barrier scheduling. Some of those scheduling policies operate at inherently >> different levels currently -- e.g. scheduling pools at the Job level >> (really, the thread local level in the current implementation) and barrier >> scheduling at the Stage level -- so it is not completely obvious how to >> unify all of these policy options/preferences/mechanism, or whether it is >> possible, but I think it is worth considering such things at a fairly high >> level of abstraction and try to unify and simplify before making things >> more complex with multiple policy mechanisms. >> >> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com> wrote: >> >>> Seems like a good idea in general. Do other systems have similar >>> concepts? In general it'd be easier if we can follow existing convention if >>> there is any. >>> >>> >>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> wrote: >>> >>>> Hi all, >>>> >>>> Many Spark users in my company are asking for a way to control the >>>> number of output files in Spark SQL. There are use cases to either reduce >>>> or increase the number. The users prefer not to use function >>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to write >>>> and deploy Scala/Java/Python code. >>>> >>>> Could we introduce a query hint for this purpose (similar to Broadcast >>>> Join Hints)? >>>> >>>> /*+ *COALESCE*(n, shuffle) */ >>>> >>>> In general, is query hint is the best way to bring DF functionality to >>>> SQL without extending SQL syntax? Any suggestion is highly appreciated. >>>> >>>> This requirement is not the same as SPARK-6221 that asked for >>>> auto-merging output files. >>>> >>>> Thanks, >>>> John Zhuge >>>> >>>