[spark] branch branch-3.0 updated: [SPARK-34922][SQL][3.0] Use a relative cost comparison function in the CBO
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new b9ee41f [SPARK-34922][SQL][3.0] Use a relative cost comparison function in the CBO b9ee41f is described below commit b9ee41fa9957631ca0f859ee928358c108fbd9a9 Author: Tanel Kiis AuthorDate: Thu Apr 8 11:03:59 2021 +0900 [SPARK-34922][SQL][3.0] Use a relative cost comparison function in the CBO ### What changes were proposed in this pull request? Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes. ### Why are the changes needed? In #30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`. With that we caused a performance regressions in some queries - TPCDS q19 for example. The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`. Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios. Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other. For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.01 > 1`. When working with ratios, then it would be better to multiply them. The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w * relativeSize^(1-w) < 1`. ### Does this PR introduce _any_ user-facing change? Comparison of the changed TPCDS v1.4 query execution times at sf=10: | absolute | multiplicative | | additive | -- | -- | -- | -- | -- | -- q12 | 145 | 137 | -5.52% | 141 | -2.76% q13 | 264 | 271 | 2.65% | 271 | 2.65% q17 | 4521 | 4243 | -6.15% | 4348 | -3.83% q18 | 758 | 466 | -38.52% | 480 | -36.68% q19 | 38503 | 2167 | -94.37% | 2176 | -94.35% q20 | 119 | 120 | 0.84% | 126 | 5.88% q24a | 16429 | 16838 | 2.49% | 17103 | 4.10% q24b | 16592 | 16999 | 2.45% | 17268 | 4.07% q25 | 3558 | 3556 | -0.06% | 3675 | 3.29% q33 | 362 | 361 | -0.28% | 380 | 4.97% q52 | 1020 | 1032 | 1.18% | 1052 | 3.14% q55 | 927 | 938 | 1.19% | 961 | 3.67% q72 | 24169 | 13377 | -44.65% | 24306 | 0.57% q81 | 1285 | 1185 | -7.78% | 1168 | -9.11% q91 | 324 | 336 | 3.70% | 337 | 4.01% q98 | 126 | 129 | 2.38% | 131 | 3.97% All times are in ms, the change is compared to the situation in the master branch (absolute). The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved. ### How was this patch tested? PlanStabilitySuite Closes #32076 from tanelk/SPARK-34922_cbo_better_cost_function_3.0. Lead-authored-by: Tanel Kiis Co-authored-by: tanel.k...@gmail.com Signed-off-by: Takeshi Yamamuro --- .../catalyst/optimizer/CostBasedJoinReorder.scala | 28 ++ .../org/apache/spark/sql/internal/SQLConf.scala| 6 +++-- .../sql/catalyst/optimizer/JoinReorderSuite.scala | 3 --- .../optimizer/StarJoinCostBasedReorderSuite.scala | 9 +++ 4 files changed, 32 insertions(+), 14 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala index 93c608dc..ed7d92e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala @@ -343,12 +343,30 @@ object JoinReorderDP extends PredicateHelper with Logging { } } +/** + * To identify the plan with smaller computational cost, + * we use the weighted geometric mean of ratio of rows and the ratio of sizes in bytes. + * + * There are other ways to combine these values as a cost comparison function. + * Some of these, that we have experimented with, but have gotten worse result, + * than with the current one: + * 1) Weighted arithmetic mean of these two ratios - adding up
[spark] branch branch-3.1 updated (f6b5c6f -> 84d96e8)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from f6b5c6f [SPARK-34970][SQL][SERCURITY][3.1] Redact map-type options in the output of explain() add 84d96e8 [SPARK-34922][SQL][3.1] Use a relative cost comparison function in the CBO No new revisions were added by this update. Summary of changes: .../catalyst/optimizer/CostBasedJoinReorder.scala | 28 +- .../org/apache/spark/sql/internal/SQLConf.scala| 6 +- .../optimizer/joinReorder/JoinReorderSuite.scala | 3 - .../StarJoinCostBasedReorderSuite.scala| 9 +- .../approved-plans-modified/q73.sf100/explain.txt | 8 +- .../approved-plans-v1_4/q12.sf100/explain.txt | 174 ++--- .../approved-plans-v1_4/q12.sf100/simplified.txt | 52 +- .../approved-plans-v1_4/q13.sf100/explain.txt | 138 ++-- .../approved-plans-v1_4/q13.sf100/simplified.txt | 34 +- .../approved-plans-v1_4/q18.sf100/explain.txt | 303 .../approved-plans-v1_4/q18.sf100/simplified.txt | 50 +- .../approved-plans-v1_4/q19.sf100/explain.txt | 368 - .../approved-plans-v1_4/q19.sf100/simplified.txt | 116 +-- .../approved-plans-v1_4/q20.sf100/explain.txt | 174 ++--- .../approved-plans-v1_4/q20.sf100/simplified.txt | 52 +- .../approved-plans-v1_4/q24a.sf100/explain.txt | 832 +++-- .../approved-plans-v1_4/q24a.sf100/simplified.txt | 34 +- .../approved-plans-v1_4/q24b.sf100/explain.txt | 832 +++-- .../approved-plans-v1_4/q24b.sf100/simplified.txt | 34 +- .../approved-plans-v1_4/q25.sf100/explain.txt | 186 ++--- .../approved-plans-v1_4/q25.sf100/simplified.txt | 130 ++-- .../approved-plans-v1_4/q33.sf100/explain.txt | 395 +- .../approved-plans-v1_4/q33.sf100/simplified.txt | 58 +- .../approved-plans-v1_4/q52.sf100/explain.txt | 138 ++-- .../approved-plans-v1_4/q52.sf100/simplified.txt | 26 +- .../approved-plans-v1_4/q55.sf100/explain.txt | 134 ++-- .../approved-plans-v1_4/q55.sf100/simplified.txt | 26 +- .../approved-plans-v1_4/q72.sf100/explain.txt | 260 +++ .../approved-plans-v1_4/q72.sf100/simplified.txt | 150 ++-- .../approved-plans-v1_4/q81.sf100/explain.txt | 570 +++--- .../approved-plans-v1_4/q81.sf100/simplified.txt | 142 ++-- .../approved-plans-v1_4/q91.sf100/explain.txt | 304 .../approved-plans-v1_4/q91.sf100/simplified.txt | 62 +- .../approved-plans-v1_4/q98.sf100/explain.txt | 182 ++--- .../approved-plans-v1_4/q98.sf100/simplified.txt | 52 +- .../approved-plans-v2_7/q12.sf100/explain.txt | 174 ++--- .../approved-plans-v2_7/q12.sf100/simplified.txt | 52 +- .../approved-plans-v2_7/q18a.sf100/explain.txt | 737 +- .../approved-plans-v2_7/q18a.sf100/simplified.txt | 54 +- .../approved-plans-v2_7/q20.sf100/explain.txt | 174 ++--- .../approved-plans-v2_7/q20.sf100/simplified.txt | 52 +- .../approved-plans-v2_7/q72.sf100/explain.txt | 260 +++ .../approved-plans-v2_7/q72.sf100/simplified.txt | 150 ++-- .../approved-plans-v2_7/q98.sf100/explain.txt | 178 ++--- .../approved-plans-v2_7/q98.sf100/simplified.txt | 52 +- 45 files changed, 4024 insertions(+), 3921 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e5d972e [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path e5d972e is described below commit e5d972e84e973d9a2e62312dc471df30c35269bc Author: Kousuke Saruta AuthorDate: Wed Apr 7 11:43:03 2021 -0700 [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path ### What changes were proposed in this pull request? This PR fixes an issue that `ADD JAR` command can't add jar files which contain whitespaces in the path though `ADD FILE` and `ADD ARCHIVE` work with such files. If we have `/some/path/test file.jar` and execute the following command: ``` ADD JAR "/some/path/test file.jar"; ``` The following exception is thrown. ``` 21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"] java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar at java.net.URI.create(URI.java:852) at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129) at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) ``` This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't check whether the form of the path is URI or plain path and it always regards the path as URI form. Whitespces should be encoded to `%20` so `/some/path/test file.jar` is rejected. We can resolve this part by checking whether the given path is URI form or not. Unfortunatelly, if we fix this part, another problem occurs. When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in `HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively invoked. In `AddResourceProcessor.run`, the command line is just split by ` s+` and the path is also split into `/some/path/test` and `file.jar` and passed to `ss.add_resources`. https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75 So, the command still fails. Even if we convert the form of the path to URI like `file:/some/path/test%20file.jar` and execute the following command: ``` ADD JAR "file:/some/path/test%20file"; ``` The following exception is thrown. ``` 21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74) ``` The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` returns `null` as the result of `fs.exists(path)` is `false`. https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064 `fs.exists` checks the existence of the given path by comparing the string representation of Hadoop's `Path`. The string representation of `Path` is similar to URI but it's actually different. `Path` doesn't encode the given path. For example, the URI form of `/some/path/jar file.jar` is `file:/some/path/jar%20file.jar` but the `Path` form of it is `file:/some/path/jar file.jar`. So `fs.exists` returns false. So the solution I come up with is removing Hive's `ADD JAR` from `HiveClientimpl.addJar`. I think Hive's `ADD JAR` was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution. https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103 But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved,
[spark] branch branch-3.1 updated (eefef50 -> f6b5c6f)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from eefef50 [SPARK-34965][BUILD] Remove .sbtopts that duplicately sets the default memory add f6b5c6f [SPARK-34970][SQL][SERCURITY][3.1] Redact map-type options in the output of explain() No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/trees/TreeNode.scala | 17 ++- .../resources/sql-tests/results/describe.sql.out | 2 +- .../scala/org/apache/spark/sql/ExplainSuite.scala | 53 ++ 3 files changed, 69 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3dfd456 -> 8e15ac1)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3dfd456 [SPARK-34668][SQL] Support casting of day-time intervals to strings add 8e15ac1 [SPARK-34493][DOCS] Add "TEXT Files" page for Data Source documents No new revisions were added by this update. Summary of changes: docs/_data/menu-sql.yaml | 2 + ...ata-sources-csv.md => sql-data-sources-text.md} | 14 +++--- docs/sql-data-sources.md | 1 + .../examples/sql/JavaSQLDataSourceExample.java | 50 + examples/src/main/python/sql/datasource.py | 51 ++ .../spark/examples/sql/SQLDataSourceExample.scala | 48 6 files changed, 159 insertions(+), 7 deletions(-) copy docs/{sql-data-sources-csv.md => sql-data-sources-text.md} (53%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-34668][SQL] Support casting of day-time intervals to strings
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3dfd456 [SPARK-34668][SQL] Support casting of day-time intervals to strings 3dfd456 is described below commit 3dfd456b2c4133f751a67e4132196d2d1470af29 Author: Max Gekk AuthorDate: Wed Apr 7 13:28:55 2021 + [SPARK-34668][SQL] Support casting of day-time intervals to strings ### What changes were proposed in this pull request? 1. Added new method `toDayTimeIntervalString()` to `IntervalUtils` which converts a day-time interval as a number of microseconds to a string in the form **"INTERVAL '[sign]days hours:minutes:secondsWithFraction' DAY TO SECOND"**. 2. Extended the `Cast` expression to support casting of `DayTimeIntervalType` to `StringType`. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? Should not because new day-time interval has not been released yet. ### How was this patch tested? Added new tests for casting: ``` $ build/sbt "testOnly *CastSuite*" ``` Closes #32070 from MaxGekk/cast-dt-interval-to-string. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/expressions/Cast.scala | 6 .../spark/sql/catalyst/util/IntervalUtils.scala| 32 + .../spark/sql/catalyst/expressions/CastSuite.scala | 33 +++--- 3 files changed, 67 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala index 1c37713..879b154 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala @@ -408,6 +408,8 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit buildCast[Any](_, o => UTF8String.fromString(udt.deserialize(o).toString)) case YearMonthIntervalType => buildCast[Int](_, i => UTF8String.fromString(IntervalUtils.toYearMonthIntervalString(i))) +case DayTimeIntervalType => + buildCast[Long](_, i => UTF8String.fromString(IntervalUtils.toDayTimeIntervalString(i))) case _ => buildCast[Any](_, o => UTF8String.fromString(o.toString)) } @@ -1127,6 +1129,10 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit val iu = IntervalUtils.getClass.getName.stripSuffix("$") (c, evPrim, _) => code"""$evPrim = UTF8String.fromString($iu.toYearMonthIntervalString($c));""" + case DayTimeIntervalType => +val iu = IntervalUtils.getClass.getName.stripSuffix("$") +(c, evPrim, _) => + code"""$evPrim = UTF8String.fromString($iu.toDayTimeIntervalString($c));""" case _ => (c, evPrim, evNull) => code"$evPrim = UTF8String.fromString(String.valueOf($c));" } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala index 8cd9d28..b96a7b9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala @@ -851,4 +851,36 @@ object IntervalUtils { } s"INTERVAL '$sign${absMonths / MONTHS_PER_YEAR}-${absMonths % MONTHS_PER_YEAR}' YEAR TO MONTH" } + + /** + * Converts a day-time interval as a number of microseconds to its textual representation + * which conforms to the ANSI SQL standard. + * + * @param micros The number of microseconds, positive or negative + * @return Day-time interval string + */ + def toDayTimeIntervalString(micros: Long): String = { +var sign = "" +var rest = micros +if (micros < 0) { + if (micros == Long.MinValue) { +// Especial handling of minimum `Long` value because negate op overflows `Long`. +// seconds = 106751991 * (24 * 60 * 60) + 4 * 60 * 60 + 54 = 9223372036854 +// microseconds = -922337203685400L-775808 == Long.MinValue +return "INTERVAL '-106751991 04:00:54.775808' DAY TO SECOND" + } else { +sign = "-" +rest = -rest + } +} +val seconds = rest % MICROS_PER_MINUTE +rest /= MICROS_PER_MINUTE +val minutes = rest % MINUTES_PER_HOUR +rest /= MINUTES_PER_HOUR +val hours = rest % HOURS_PER_DAY +val days = rest / HOURS_PER_DAY +val leadSecZero = if (seconds < 10 * MICROS_PER_SECOND) "0" else "" +val secStr = java.math.BigDecimal.valueOf(seconds,
[spark] branch master updated: [SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5a3f41a [SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets 5a3f41a is described below commit 5a3f41a017bf0fb07cf242e5c6ba25fc231863c6 Author: Angerszh AuthorDate: Wed Apr 7 13:27:21 2021 + [SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets ### What changes were proposed in this pull request? Current trait `GroupingSet` is ambiguous, since `grouping set` in parser level means one set of a group. Rename this to `BaseGroupingSets` since cube/rollup is syntax sugar for grouping sets.` ### Why are the changes needed? Refactor class name ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32073 from AngersZh/SPARK-34976. Authored-by: Angerszh Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/analysis/Analyzer.scala | 10 - .../spark/sql/catalyst/expressions/grouping.scala | 25 -- .../spark/sql/catalyst/parser/AstBuilder.scala | 3 ++- .../analysis/ResolveGroupingAnalyticsSuite.scala | 4 ++-- 4 files changed, 23 insertions(+), 19 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index bbb7662..f50c28a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -599,7 +599,7 @@ class Analyzer(override val catalogManager: CatalogManager) val aggForResolving = h.child match { // For CUBE/ROLLUP expressions, to avoid resolving repeatedly, here we delete them from // groupingExpressions for condition resolving. -case a @ Aggregate(Seq(gs: GroupingSet), _, _) => +case a @ Aggregate(Seq(gs: BaseGroupingSets), _, _) => a.copy(groupingExpressions = gs.groupByExprs) } // Try resolving the condition of the filter as though it is in the aggregate clause @@ -610,7 +610,7 @@ class Analyzer(override val catalogManager: CatalogManager) if (resolvedInfo.nonEmpty) { val (extraAggExprs, resolvedHavingCond) = resolvedInfo.get val newChild = h.child match { - case Aggregate(Seq(gs: GroupingSet), aggregateExpressions, child) => + case Aggregate(Seq(gs: BaseGroupingSets), aggregateExpressions, child) => constructAggregate( gs.selectedGroupByExprs, gs.groupByExprs, aggregateExpressions ++ extraAggExprs, child) @@ -636,14 +636,14 @@ class Analyzer(override val catalogManager: CatalogManager) // CUBE/ROLLUP/GROUPING SETS. This also replace grouping()/grouping_id() in resolved // Filter/Sort. def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsDown { - case h @ UnresolvedHaving(_, agg @ Aggregate(Seq(gs: GroupingSet), aggregateExpressions, _)) -if agg.childrenResolved && (gs.children ++ aggregateExpressions).forall(_.resolved) => + case h @ UnresolvedHaving(_, agg @ Aggregate(Seq(gs: BaseGroupingSets), aggExprs, _)) +if agg.childrenResolved && (gs.children ++ aggExprs).forall(_.resolved) => tryResolveHavingCondition(h) case a if !a.childrenResolved => a // be sure all of the children are resolved. // Ensure group by expressions and aggregate expressions have been resolved. - case Aggregate(Seq(gs: GroupingSet), aggregateExpressions, child) + case Aggregate(Seq(gs: BaseGroupingSets), aggregateExpressions, child) if (gs.children ++ aggregateExpressions).forall(_.resolved) => constructAggregate(gs.selectedGroupByExprs, gs.groupByExprs, aggregateExpressions, child) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala index 66cd140..bf28efa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala @@ -26,14 +26,14 @@ import org.apache.spark.sql.types._ /** * A placeholder expression for cube/rollup, which will be replaced by analyzer */ -trait GroupingSet extends Expression with CodegenFallback { +trait BaseGroupingSets extends Expression with CodegenFallback { def groupingSets: Seq[Seq[Expression]] def selectedGroupByExprs: Seq[Seq[Expression]] def groupByExprs: Seq[Expression] = { assert(children.forall(_.resolved), - "Cannot call
[spark] branch master updated: [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2635c38 [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work 2635c38 is described below commit 2635c3894ff935bf1cd2d86648a28dcb4dc3dc73 Author: Takuya UESHIN AuthorDate: Wed Apr 7 20:50:41 2021 +0900 [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work ### What changes were proposed in this pull request? Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure. ### Why are the changes needed? Currently the pandas-on-Spark modules are not tested at all. We should enable doctests first, and we will port other unit tests separately later. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled the whole doctests. Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests. Authored-by: Takuya UESHIN Signed-off-by: HyukjinKwon --- .github/workflows/build_and_test.yml | 2 + dev/run-tests.py | 6 +-- dev/sparktestsupport/modules.py| 44 python/pyspark/pandas/__init__.py | 17 -- python/pyspark/pandas/accessors.py | 30 +++ python/pyspark/pandas/base.py | 28 ++ python/pyspark/pandas/categorical.py | 30 +++ python/pyspark/pandas/config.py| 28 ++ python/pyspark/pandas/datetimes.py | 30 +++ python/pyspark/pandas/exceptions.py| 30 +++ python/pyspark/pandas/extensions.py| 32 python/pyspark/pandas/frame.py | 53 +-- python/pyspark/pandas/generic.py | 38 ++ python/pyspark/pandas/groupby.py | 40 +-- python/pyspark/pandas/indexes/base.py | 30 +++ python/pyspark/pandas/indexes/category.py | 30 +++ python/pyspark/pandas/indexes/datetimes.py | 30 +++ python/pyspark/pandas/indexes/multi.py | 32 python/pyspark/pandas/indexes/numeric.py | 30 +++ python/pyspark/pandas/indexing.py | 30 +++ python/pyspark/pandas/internal.py | 30 +++ python/pyspark/pandas/ml.py| 24 + python/pyspark/pandas/mlflow.py| 39 +- python/pyspark/pandas/namespace.py | 60 +++--- python/pyspark/pandas/numpy_compat.py | 30 +++ python/pyspark/pandas/series.py| 30 ++- python/pyspark/pandas/spark/accessors.py | 48 + python/pyspark/pandas/spark/utils.py | 19 +++ python/pyspark/pandas/{sql.py => sql_processor.py} | 30 +++ python/pyspark/pandas/strings.py | 30 +++ python/pyspark/pandas/typedef/typehints.py | 19 +++ python/pyspark/pandas/usage_logging/__init__.py| 6 +-- python/pyspark/pandas/utils.py | 28 ++ python/pyspark/pandas/window.py| 28 ++ 34 files changed, 983 insertions(+), 28 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 9253d58..3abe206 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -161,6 +161,8 @@ jobs: pyspark-sql, pyspark-mllib, pyspark-resource - >- pyspark-core, pyspark-streaming, pyspark-ml + - >- +pyspark-pandas env: MODULES_TO_TEST: ${{ matrix.modules }} HADOOP_PROFILE: hadoop3.2 diff --git a/dev/run-tests.py b/dev/run-tests.py index 83f9f02..c5b412d 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -123,17 +123,17 @@ def determine_modules_to_test(changed_modules, deduplicated=True): >>> [x.name for x in determine_modules_to_test([modules.sql])] ... # doctest: +NORMALIZE_WHITESPACE ['sql', 'avro', 'hive', 'mllib', 'sql-kafka-0-10', 'examples', 'hive-thriftserver', - 'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-ml'] + 'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-ml'] >>> sorted([x.name for x in determine_modules_to_test( ... [modules.sparkr, modules.sql], deduplicated=False)]) ... # doctest: +NORMALIZE_WHITESPACE ['avro', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'pyspark-ml', - 'pyspark-mllib', 'pyspark-sql', 'repl', 'sparkr',
[spark] branch master updated (3c7d6c3 -> f208d80)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3c7d6c3 [SPARK-27658][SQL] Add FunctionCatalog API add f208d80 [SPARK-34970][SQL][SERCURITY] Redact map-type options in the output of explain() No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/trees/TreeNode.scala | 17 ++- .../resources/sql-tests/results/describe.sql.out | 2 +- .../scala/org/apache/spark/sql/ExplainSuite.scala | 53 ++ 3 files changed, 69 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (06c09a7 -> 3c7d6c3)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 06c09a7 [SPARK-34969][SPARK-34906][SQL] Followup for Refactor TreeNode's children handling methods into specialized traits add 3c7d6c3 [SPARK-27658][SQL] Add FunctionCatalog API No new revisions were added by this update. Summary of changes: .../sql/connector/catalog/FunctionCatalog.java | 49 +++ .../catalog/functions/AggregateFunction.java | 94 + .../connector/catalog/functions/BoundFunction.java | 99 ++ .../sql/connector/catalog/functions/Function.java} | 17 ++- .../catalog/functions/ScalarFunction.java | 49 +++ .../catalog/functions/UnboundFunction.java | 50 +++ .../catalyst/analysis/NoSuchItemException.scala| 18 ++- .../catalog/functions/AggregateFunctionSuite.scala | 148 + 8 files changed, 513 insertions(+), 11 deletions(-) create mode 100644 sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/FunctionCatalog.java create mode 100644 sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java create mode 100644 sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java copy sql/{core/src/main/java/org/apache/spark/sql/api/java/UDF12.java => catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Function.java} (74%) create mode 100644 sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java create mode 100644 sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/UnboundFunction.java create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0aa2c28 -> 06c09a7)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0aa2c28 [SPARK-34678][SQL] Add table function registry add 06c09a7 [SPARK-34969][SPARK-34906][SQL] Followup for Refactor TreeNode's children handling methods into specialized traits No new revisions were added by this update. Summary of changes: .../org/apache/spark/ml/stat/Summarizer.scala | 8 -- .../sql/catalyst/expressions/Expression.scala | 13 +- .../catalyst/expressions/PartitionTransforms.scala | 2 +- .../catalyst/expressions/aggregate/Average.scala | 2 +- .../catalyst/expressions/aggregate/CountIf.scala | 2 +- .../expressions/aggregate/CountMinSketchAgg.scala | 13 +++--- .../expressions/aggregate/Covariance.scala | 12 - .../sql/catalyst/expressions/aggregate/Sum.scala | 2 +- .../expressions/aggregate/bitwiseAggregates.scala | 2 +- .../catalyst/expressions/aggregate/collect.scala | 2 +- .../spark/sql/catalyst/expressions/grouping.scala | 2 +- .../expressions/higherOrderFunctions.scala | 29 -- .../sql/catalyst/expressions/mathExpressions.scala | 6 - .../catalyst/expressions/regexpExpressions.scala | 6 - .../catalyst/expressions/stringExpressions.scala | 7 -- .../catalyst/expressions/windowExpressions.scala | 2 +- .../sql/catalyst/plans/logical/v2Commands.scala| 2 +- .../apache/spark/sql/catalyst/trees/TreeNode.scala | 8 ++ .../catalyst/expressions/CodeGenerationSuite.scala | 3 +-- .../sql/execution/command/DataWritingCommand.scala | 6 ++--- .../spark/sql/execution/command/commands.scala | 6 +++-- .../datasources/v2/AddPartitionExec.scala | 2 +- .../v2/AlterNamespaceSetPropertiesExec.scala | 2 +- .../execution/datasources/v2/AlterTableExec.scala | 2 +- .../execution/datasources/v2/CacheTableExec.scala | 4 +-- .../datasources/v2/CreateNamespaceExec.scala | 2 +- .../execution/datasources/v2/CreateTableExec.scala | 2 +- .../datasources/v2/DeleteFromTableExec.scala | 2 +- .../datasources/v2/DescribeColumnExec.scala| 2 +- .../datasources/v2/DescribeNamespaceExec.scala | 2 +- .../datasources/v2/DescribeTableExec.scala | 2 +- .../datasources/v2/DropNamespaceExec.scala | 2 +- .../datasources/v2/DropPartitionExec.scala | 2 +- .../execution/datasources/v2/DropTableExec.scala | 2 +- .../datasources/v2/RefreshTableExec.scala | 2 +- .../datasources/v2/RenamePartitionExec.scala | 2 +- .../execution/datasources/v2/RenameTableExec.scala | 2 +- .../datasources/v2/ReplaceTableExec.scala | 4 +-- .../v2/SetCatalogAndNamespaceExec.scala| 2 +- .../datasources/v2/ShowCurrentNamespaceExec.scala | 2 +- .../datasources/v2/ShowTablePropertiesExec.scala | 2 +- .../datasources/v2/TruncatePartitionExec.scala | 2 +- .../datasources/v2/TruncateTableExec.scala | 2 +- .../datasources/v2/V1FallbackWriters.scala | 3 +-- .../execution/datasources/v2/V2CommandExec.scala | 5 ++-- .../apache/spark/sql/DataFrameFunctionsSuite.scala | 5 ++-- .../apache/spark/sql/GeneratorFunctionSuite.scala | 4 +-- .../spark/sql/TypedImperativeAggregateSuite.scala | 8 +++--- .../sql/hive/execution/TestingTypedCount.scala | 6 ++--- 49 files changed, 127 insertions(+), 87 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org