Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21498#discussion_r193618338 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1099,6 +1099,17 @@ object SQLConf { .intConf .createWithDefault(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD.defaultValue.get) + val UNION_IN_SAME_PARTITION = + buildConf("spark.sql.unionInSamePartition") + .internal() + .doc("When true, Union operator will union children results in the same corresponding " + + "partitions if they have same partitioning. This eliminates unnecessary shuffle in later " + + "operators like aggregation. Note that because non-deterministic functions such as " + + "monotonically_increasing_id are depended on partition id. By doing this, the values of " + --- End diff -- I'm a bit not convinced by the reason and behavior of keeping the value of non-deterministic functions after an union. Like in the following queries: ```scala val df1 = spark.range(10).select(monotonically_increasing_id()) val df2 = spark.range(10).select(monotonically_increasing_id()) val union = df1.union(df2) ``` Now we keep the values of `monotonically_increasing_id` returned by `df1`, `df2` and `union` are the same. However, as non-deterministic functions, the values changing by data layout/sequence sounds still reasonable. Anyway that is current behavior and this config need users to enable this feature explicitly.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org