Great help from the community! On Sun, Aug 5, 2018 at 6:17 PM Xiao Li <gatorsm...@gmail.com> wrote:
> FYI, the new hints have been merged. They will be available in the > upcoming release (Spark 2.4). > > *John Zhuge*, thanks for your work! Really appreciate it! Please submit > more PRs and help the community improve Spark. : ) > > Xiao > > 2018-08-05 21:06 GMT-04:00 Koert Kuipers <ko...@tresata.com>: > >> lukas, >> what is the jira ticket for this? i would like to follow it's activity. >> thanks! >> koert >> >> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> >> wrote: >> >>> Hi, >>> Yes, This feature is planned - Spark should be soon able to repartition >>> output by size. >>> Lukas >>> >>> >>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <forest.f...@outlook.com> >>> napsal: >>> >>>> Has there been any discussion to simply support Hive's merge small >>>> files configuration? It simply adds one additional stage to inspect size of >>>> each output file, recompute the desired parallelism to reach a target size, >>>> and runs a map-only coalesce before committing the final files. Since AFAIK >>>> SparkSQL already stages the final output commit, it seems feasible to >>>> respect this Hive config. >>>> >>>> >>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html >>>> >>>> >>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com> >>>> wrote: >>>> >>>>> See some of the related discussion under >>>>> https://github.com/apache/spark/pull/21589 >>>>> >>>>> If feels to me like we need some kind of user code mechanism to signal >>>>> policy preferences to Spark. This could also include ways to signal >>>>> scheduling policy, which could include things like scheduling pool and/or >>>>> barrier scheduling. Some of those scheduling policies operate at >>>>> inherently >>>>> different levels currently -- e.g. scheduling pools at the Job level >>>>> (really, the thread local level in the current implementation) and barrier >>>>> scheduling at the Stage level -- so it is not completely obvious how to >>>>> unify all of these policy options/preferences/mechanism, or whether it is >>>>> possible, but I think it is worth considering such things at a fairly high >>>>> level of abstraction and try to unify and simplify before making things >>>>> more complex with multiple policy mechanisms. >>>>> >>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Seems like a good idea in general. Do other systems have similar >>>>>> concepts? In general it'd be easier if we can follow existing convention >>>>>> if >>>>>> there is any. >>>>>> >>>>>> >>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Many Spark users in my company are asking for a way to control the >>>>>>> number of output files in Spark SQL. There are use cases to either >>>>>>> reduce >>>>>>> or increase the number. The users prefer not to use function >>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to >>>>>>> write and deploy Scala/Java/Python code. >>>>>>> >>>>>>> Could we introduce a query hint for this purpose (similar to >>>>>>> Broadcast Join Hints)? >>>>>>> >>>>>>> /*+ *COALESCE*(n, shuffle) */ >>>>>>> >>>>>>> In general, is query hint is the best way to bring DF functionality >>>>>>> to SQL without extending SQL syntax? Any suggestion is highly >>>>>>> appreciated. >>>>>>> >>>>>>> This requirement is not the same as SPARK-6221 that asked for >>>>>>> auto-merging output files. >>>>>>> >>>>>>> Thanks, >>>>>>> John Zhuge >>>>>>> >>>>>> >> > -- John Zhuge