[spark] branch branch-3.0 updated: [SPARK-34922][SQL][3.0] Use a relative cost comparison function in the CBO

2021-04-07 Thread yamamuro
This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new b9ee41f  [SPARK-34922][SQL][3.0] Use a relative cost comparison 
function in the CBO
b9ee41f is described below

commit b9ee41fa9957631ca0f859ee928358c108fbd9a9
Author: Tanel Kiis 
AuthorDate: Thu Apr 8 11:03:59 2021 +0900

[SPARK-34922][SQL][3.0] Use a relative cost comparison function in the CBO

### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row 
counts and sizes in bytes.

### Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be 
"symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 
for example.

The original cost comparison function used the ratios `relativeRows = 
A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed 
function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` 
and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, 
because otherwise one (size) may overwhelm the other (rowCount). But this time 
we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was 
used. Besides being "non-symteric", this also can exhibit one overwhelming 
other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no 
matter how many times more rows does the `B` plan have, `B` will allways be 
considered to be better - `0.5*2 + 0.5*0.01 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => 
relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master 
branch (absolute).
The proposed cost function (multiplicative) significantlly improves the 
performance on q18, q19 and q72. The original cost function (additive) has 
similar improvements at q18 and q19. All other chagnes are within the error 
bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes #32076 from tanelk/SPARK-34922_cbo_better_cost_function_3.0.

Lead-authored-by: Tanel Kiis 
Co-authored-by: tanel.k...@gmail.com 
Signed-off-by: Takeshi Yamamuro 
---
 .../catalyst/optimizer/CostBasedJoinReorder.scala  | 28 ++
 .../org/apache/spark/sql/internal/SQLConf.scala|  6 +++--
 .../sql/catalyst/optimizer/JoinReorderSuite.scala  |  3 ---
 .../optimizer/StarJoinCostBasedReorderSuite.scala  |  9 +++
 4 files changed, 32 insertions(+), 14 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
index 93c608dc..ed7d92e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
@@ -343,12 +343,30 @@ object JoinReorderDP extends PredicateHelper with Logging 
{
   }
 }
 
+/**
+ * To identify the plan with smaller computational cost,
+ * we use the weighted geometric mean of ratio of rows and the ratio of 
sizes in bytes.
+ *
+ * There are other ways to combine these values as a cost comparison 
function.
+ * Some of these, that we have experimented with, but have gotten worse 
result,
+ * than with the current one:
+ * 1) Weighted arithmetic mean of these two ratios - adding up 

[spark] branch branch-3.1 updated (f6b5c6f -> 84d96e8)

2021-04-07 Thread yamamuro
This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f6b5c6f  [SPARK-34970][SQL][SERCURITY][3.1] Redact map-type options in 
the output of explain()
 add 84d96e8  [SPARK-34922][SQL][3.1] Use a relative cost comparison 
function in the CBO

No new revisions were added by this update.

Summary of changes:
 .../catalyst/optimizer/CostBasedJoinReorder.scala  |  28 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|   6 +-
 .../optimizer/joinReorder/JoinReorderSuite.scala   |   3 -
 .../StarJoinCostBasedReorderSuite.scala|   9 +-
 .../approved-plans-modified/q73.sf100/explain.txt  |   8 +-
 .../approved-plans-v1_4/q12.sf100/explain.txt  | 174 ++---
 .../approved-plans-v1_4/q12.sf100/simplified.txt   |  52 +-
 .../approved-plans-v1_4/q13.sf100/explain.txt  | 138 ++--
 .../approved-plans-v1_4/q13.sf100/simplified.txt   |  34 +-
 .../approved-plans-v1_4/q18.sf100/explain.txt  | 303 
 .../approved-plans-v1_4/q18.sf100/simplified.txt   |  50 +-
 .../approved-plans-v1_4/q19.sf100/explain.txt  | 368 -
 .../approved-plans-v1_4/q19.sf100/simplified.txt   | 116 +--
 .../approved-plans-v1_4/q20.sf100/explain.txt  | 174 ++---
 .../approved-plans-v1_4/q20.sf100/simplified.txt   |  52 +-
 .../approved-plans-v1_4/q24a.sf100/explain.txt | 832 +++--
 .../approved-plans-v1_4/q24a.sf100/simplified.txt  |  34 +-
 .../approved-plans-v1_4/q24b.sf100/explain.txt | 832 +++--
 .../approved-plans-v1_4/q24b.sf100/simplified.txt  |  34 +-
 .../approved-plans-v1_4/q25.sf100/explain.txt  | 186 ++---
 .../approved-plans-v1_4/q25.sf100/simplified.txt   | 130 ++--
 .../approved-plans-v1_4/q33.sf100/explain.txt  | 395 +-
 .../approved-plans-v1_4/q33.sf100/simplified.txt   |  58 +-
 .../approved-plans-v1_4/q52.sf100/explain.txt  | 138 ++--
 .../approved-plans-v1_4/q52.sf100/simplified.txt   |  26 +-
 .../approved-plans-v1_4/q55.sf100/explain.txt  | 134 ++--
 .../approved-plans-v1_4/q55.sf100/simplified.txt   |  26 +-
 .../approved-plans-v1_4/q72.sf100/explain.txt  | 260 +++
 .../approved-plans-v1_4/q72.sf100/simplified.txt   | 150 ++--
 .../approved-plans-v1_4/q81.sf100/explain.txt  | 570 +++---
 .../approved-plans-v1_4/q81.sf100/simplified.txt   | 142 ++--
 .../approved-plans-v1_4/q91.sf100/explain.txt  | 304 
 .../approved-plans-v1_4/q91.sf100/simplified.txt   |  62 +-
 .../approved-plans-v1_4/q98.sf100/explain.txt  | 182 ++---
 .../approved-plans-v1_4/q98.sf100/simplified.txt   |  52 +-
 .../approved-plans-v2_7/q12.sf100/explain.txt  | 174 ++---
 .../approved-plans-v2_7/q12.sf100/simplified.txt   |  52 +-
 .../approved-plans-v2_7/q18a.sf100/explain.txt | 737 +-
 .../approved-plans-v2_7/q18a.sf100/simplified.txt  |  54 +-
 .../approved-plans-v2_7/q20.sf100/explain.txt  | 174 ++---
 .../approved-plans-v2_7/q20.sf100/simplified.txt   |  52 +-
 .../approved-plans-v2_7/q72.sf100/explain.txt  | 260 +++
 .../approved-plans-v2_7/q72.sf100/simplified.txt   | 150 ++--
 .../approved-plans-v2_7/q98.sf100/explain.txt  | 178 ++---
 .../approved-plans-v2_7/q98.sf100/simplified.txt   |  52 +-
 45 files changed, 4024 insertions(+), 3921 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

2021-04-07 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e5d972e  [SPARK-34955][SQL] ADD JAR command cannot add jar files which 
contains whitespaces in the path
e5d972e is described below

commit e5d972e84e973d9a2e62312dc471df30c35269bc
Author: Kousuke Saruta 
AuthorDate: Wed Apr 7 11:43:03 2021 -0700

[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains 
whitespaces in the path

### What changes were proposed in this pull request?

This PR fixes an issue that `ADD JAR` command can't add jar files which 
contain whitespaces in the path though `ADD FILE` and `ADD ARCHIVE` work with 
such files.

If we have `/some/path/test file.jar` and execute the following command:

```
ADD JAR "/some/path/test file.jar";
```

The following exception is thrown.

```
21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test 
file.jar"]
java.lang.IllegalArgumentException: Illegal character in path at index 9: 
/some/path/test file.jar
at java.net.URI.create(URI.java:852)
at 
org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
at 
org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
```

This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't 
check whether the form of the path is URI or plain path and it always regards 
the path as URI form.
Whitespces should be encoded to `%20` so `/some/path/test file.jar` is 
rejected.
We can resolve this part by checking whether the given path is URI form or 
not.

Unfortunatelly, if we fix this part, another problem occurs.
When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in 
`HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively invoked.
In `AddResourceProcessor.run`, the command line is just split by `
s+` and the path is also split into `/some/path/test` and `file.jar` and 
passed to `ss.add_resources`.

https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75
So, the command still fails.

Even if we convert the form of the path to URI like 
`file:/some/path/test%20file.jar` and execute the following command:

```
ADD JAR "file:/some/path/test%20file";
```

The following exception is thrown.

```
21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does 
not exist
java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does 
not exist
at 
org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168)
at 
org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289)
at 
org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278)
at 
org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378)
at 
org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336)
at 
org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74)
```

The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` 
returns `null` as the result of `fs.exists(path)` is `false`.

https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064

`fs.exists` checks the existence of the given path by comparing the string 
representation of Hadoop's `Path`.
The string representation of `Path` is similar to URI but it's actually 
different.
`Path` doesn't encode the given path.
For example, the URI form of `/some/path/jar file.jar` is 
`file:/some/path/jar%20file.jar` but the `Path` form of it is 
`file:/some/path/jar file.jar`. So `fs.exists` returns false.

So the solution I come up with is removing Hive's `ADD JAR` from 
`HiveClientimpl.addJar`.
I think Hive's `ADD JAR` was used to add jar files to the class loader for 
metadata and isolate the class loader from the one for execution.

https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103

But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, 

[spark] branch branch-3.1 updated (eefef50 -> f6b5c6f)

2021-04-07 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git.


from eefef50  [SPARK-34965][BUILD] Remove .sbtopts that duplicately sets 
the default memory
 add f6b5c6f  [SPARK-34970][SQL][SERCURITY][3.1] Redact map-type options in 
the output of explain()

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/trees/TreeNode.scala | 17 ++-
 .../resources/sql-tests/results/describe.sql.out   |  2 +-
 .../scala/org/apache/spark/sql/ExplainSuite.scala  | 53 ++
 3 files changed, 69 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (3dfd456 -> 8e15ac1)

2021-04-07 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3dfd456  [SPARK-34668][SQL] Support casting of day-time intervals to 
strings
 add 8e15ac1  [SPARK-34493][DOCS] Add "TEXT Files" page for Data Source 
documents

No new revisions were added by this update.

Summary of changes:
 docs/_data/menu-sql.yaml   |  2 +
 ...ata-sources-csv.md => sql-data-sources-text.md} | 14 +++---
 docs/sql-data-sources.md   |  1 +
 .../examples/sql/JavaSQLDataSourceExample.java | 50 +
 examples/src/main/python/sql/datasource.py | 51 ++
 .../spark/examples/sql/SQLDataSourceExample.scala  | 48 
 6 files changed, 159 insertions(+), 7 deletions(-)
 copy docs/{sql-data-sources-csv.md => sql-data-sources-text.md} (53%)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-34668][SQL] Support casting of day-time intervals to strings

2021-04-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3dfd456  [SPARK-34668][SQL] Support casting of day-time intervals to 
strings
3dfd456 is described below

commit 3dfd456b2c4133f751a67e4132196d2d1470af29
Author: Max Gekk 
AuthorDate: Wed Apr 7 13:28:55 2021 +

[SPARK-34668][SQL] Support casting of day-time intervals to strings

### What changes were proposed in this pull request?
1. Added new method `toDayTimeIntervalString()` to `IntervalUtils` which 
converts a day-time interval as a number of microseconds to a string in the 
form **"INTERVAL '[sign]days hours:minutes:secondsWithFraction' DAY TO 
SECOND"**.
2. Extended the `Cast` expression to support casting of 
`DayTimeIntervalType` to `StringType`.

### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.

### Does this PR introduce _any_ user-facing change?
Should not because new day-time interval has not been released yet.

### How was this patch tested?
Added new tests for casting:
```
$ build/sbt "testOnly *CastSuite*"
```

Closes #32070 from MaxGekk/cast-dt-interval-to-string.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/Cast.scala  |  6 
 .../spark/sql/catalyst/util/IntervalUtils.scala| 32 +
 .../spark/sql/catalyst/expressions/CastSuite.scala | 33 +++---
 3 files changed, 67 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
index 1c37713..879b154 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
@@ -408,6 +408,8 @@ abstract class CastBase extends UnaryExpression with 
TimeZoneAwareExpression wit
   buildCast[Any](_, o => 
UTF8String.fromString(udt.deserialize(o).toString))
 case YearMonthIntervalType =>
   buildCast[Int](_, i => 
UTF8String.fromString(IntervalUtils.toYearMonthIntervalString(i)))
+case DayTimeIntervalType =>
+  buildCast[Long](_, i => 
UTF8String.fromString(IntervalUtils.toDayTimeIntervalString(i)))
 case _ => buildCast[Any](_, o => UTF8String.fromString(o.toString))
   }
 
@@ -1127,6 +1129,10 @@ abstract class CastBase extends UnaryExpression with 
TimeZoneAwareExpression wit
 val iu = IntervalUtils.getClass.getName.stripSuffix("$")
 (c, evPrim, _) =>
   code"""$evPrim = 
UTF8String.fromString($iu.toYearMonthIntervalString($c));"""
+  case DayTimeIntervalType =>
+val iu = IntervalUtils.getClass.getName.stripSuffix("$")
+(c, evPrim, _) =>
+  code"""$evPrim = 
UTF8String.fromString($iu.toDayTimeIntervalString($c));"""
   case _ =>
 (c, evPrim, evNull) => code"$evPrim = 
UTF8String.fromString(String.valueOf($c));"
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
index 8cd9d28..b96a7b9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
@@ -851,4 +851,36 @@ object IntervalUtils {
 }
 s"INTERVAL '$sign${absMonths / MONTHS_PER_YEAR}-${absMonths % 
MONTHS_PER_YEAR}' YEAR TO MONTH"
   }
+
+  /**
+   * Converts a day-time interval as a number of microseconds to its textual 
representation
+   * which conforms to the ANSI SQL standard.
+   *
+   * @param micros The number of microseconds, positive or negative
+   * @return Day-time interval string
+   */
+  def toDayTimeIntervalString(micros: Long): String = {
+var sign = ""
+var rest = micros
+if (micros < 0) {
+  if (micros == Long.MinValue) {
+// Especial handling of minimum `Long` value because negate op 
overflows `Long`.
+// seconds = 106751991 * (24 * 60 * 60) + 4 * 60 * 60 + 54 = 
9223372036854
+// microseconds = -922337203685400L-775808 == Long.MinValue
+return "INTERVAL '-106751991 04:00:54.775808' DAY TO SECOND"
+  } else {
+sign = "-"
+rest = -rest
+  }
+}
+val seconds = rest % MICROS_PER_MINUTE
+rest /= MICROS_PER_MINUTE
+val minutes = rest % MINUTES_PER_HOUR
+rest /= MINUTES_PER_HOUR
+val hours = rest % HOURS_PER_DAY
+val days = rest / HOURS_PER_DAY
+val leadSecZero = if (seconds < 10 * MICROS_PER_SECOND) "0" else ""
+val secStr = java.math.BigDecimal.valueOf(seconds, 

[spark] branch master updated: [SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets

2021-04-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5a3f41a  [SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets
5a3f41a is described below

commit 5a3f41a017bf0fb07cf242e5c6ba25fc231863c6
Author: Angerszh 
AuthorDate: Wed Apr 7 13:27:21 2021 +

[SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets

### What changes were proposed in this pull request?
Current trait `GroupingSet` is ambiguous, since `grouping set` in parser 
level means one set of a group.
Rename this to `BaseGroupingSets` since cube/rollup is syntax sugar for 
grouping sets.`

### Why are the changes needed?
Refactor class name

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32073 from AngersZh/SPARK-34976.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala | 10 -
 .../spark/sql/catalyst/expressions/grouping.scala  | 25 --
 .../spark/sql/catalyst/parser/AstBuilder.scala |  3 ++-
 .../analysis/ResolveGroupingAnalyticsSuite.scala   |  4 ++--
 4 files changed, 23 insertions(+), 19 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index bbb7662..f50c28a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -599,7 +599,7 @@ class Analyzer(override val catalogManager: CatalogManager)
   val aggForResolving = h.child match {
 // For CUBE/ROLLUP expressions, to avoid resolving repeatedly, here we 
delete them from
 // groupingExpressions for condition resolving.
-case a @ Aggregate(Seq(gs: GroupingSet), _, _) =>
+case a @ Aggregate(Seq(gs: BaseGroupingSets), _, _) =>
   a.copy(groupingExpressions = gs.groupByExprs)
   }
   // Try resolving the condition of the filter as though it is in the 
aggregate clause
@@ -610,7 +610,7 @@ class Analyzer(override val catalogManager: CatalogManager)
   if (resolvedInfo.nonEmpty) {
 val (extraAggExprs, resolvedHavingCond) = resolvedInfo.get
 val newChild = h.child match {
-  case Aggregate(Seq(gs: GroupingSet), aggregateExpressions, child) =>
+  case Aggregate(Seq(gs: BaseGroupingSets), aggregateExpressions, 
child) =>
 constructAggregate(
   gs.selectedGroupByExprs, gs.groupByExprs,
   aggregateExpressions ++ extraAggExprs, child)
@@ -636,14 +636,14 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 // CUBE/ROLLUP/GROUPING SETS. This also replace grouping()/grouping_id() 
in resolved
 // Filter/Sort.
 def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsDown {
-  case h @ UnresolvedHaving(_, agg @ Aggregate(Seq(gs: GroupingSet), 
aggregateExpressions, _))
-if agg.childrenResolved && (gs.children ++ 
aggregateExpressions).forall(_.resolved) =>
+  case h @ UnresolvedHaving(_, agg @ Aggregate(Seq(gs: BaseGroupingSets), 
aggExprs, _))
+if agg.childrenResolved && (gs.children ++ 
aggExprs).forall(_.resolved) =>
 tryResolveHavingCondition(h)
 
   case a if !a.childrenResolved => a // be sure all of the children are 
resolved.
 
   // Ensure group by expressions and aggregate expressions have been 
resolved.
-  case Aggregate(Seq(gs: GroupingSet), aggregateExpressions, child)
+  case Aggregate(Seq(gs: BaseGroupingSets), aggregateExpressions, child)
 if (gs.children ++ aggregateExpressions).forall(_.resolved) =>
 constructAggregate(gs.selectedGroupByExprs, gs.groupByExprs, 
aggregateExpressions, child)
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
index 66cd140..bf28efa 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
@@ -26,14 +26,14 @@ import org.apache.spark.sql.types._
 /**
  * A placeholder expression for cube/rollup, which will be replaced by analyzer
  */
-trait GroupingSet extends Expression with CodegenFallback {
+trait BaseGroupingSets extends Expression with CodegenFallback {
 
   def groupingSets: Seq[Seq[Expression]]
   def selectedGroupByExprs: Seq[Seq[Expression]]
 
   def groupByExprs: Seq[Expression] = {
 assert(children.forall(_.resolved),
-  "Cannot call 

[spark] branch master updated: [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work

2021-04-07 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2635c38  [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work
2635c38 is described below

commit 2635c3894ff935bf1cd2d86648a28dcb4dc3dc73
Author: Takuya UESHIN 
AuthorDate: Wed Apr 7 20:50:41 2021 +0900

[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work

### What changes were proposed in this pull request?

Now that we merged the Koalas main code into PySpark code base (#32036), we 
should enable doctests on the Spark's infrastructure.

### Why are the changes needed?

Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests 
separately later.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Enabled the whole doctests.

Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests.

Authored-by: Takuya UESHIN 
Signed-off-by: HyukjinKwon 
---
 .github/workflows/build_and_test.yml   |  2 +
 dev/run-tests.py   |  6 +--
 dev/sparktestsupport/modules.py| 44 
 python/pyspark/pandas/__init__.py  | 17 --
 python/pyspark/pandas/accessors.py | 30 +++
 python/pyspark/pandas/base.py  | 28 ++
 python/pyspark/pandas/categorical.py   | 30 +++
 python/pyspark/pandas/config.py| 28 ++
 python/pyspark/pandas/datetimes.py | 30 +++
 python/pyspark/pandas/exceptions.py| 30 +++
 python/pyspark/pandas/extensions.py| 32 
 python/pyspark/pandas/frame.py | 53 +--
 python/pyspark/pandas/generic.py   | 38 ++
 python/pyspark/pandas/groupby.py   | 40 +--
 python/pyspark/pandas/indexes/base.py  | 30 +++
 python/pyspark/pandas/indexes/category.py  | 30 +++
 python/pyspark/pandas/indexes/datetimes.py | 30 +++
 python/pyspark/pandas/indexes/multi.py | 32 
 python/pyspark/pandas/indexes/numeric.py   | 30 +++
 python/pyspark/pandas/indexing.py  | 30 +++
 python/pyspark/pandas/internal.py  | 30 +++
 python/pyspark/pandas/ml.py| 24 +
 python/pyspark/pandas/mlflow.py| 39 +-
 python/pyspark/pandas/namespace.py | 60 +++---
 python/pyspark/pandas/numpy_compat.py  | 30 +++
 python/pyspark/pandas/series.py| 30 ++-
 python/pyspark/pandas/spark/accessors.py   | 48 +
 python/pyspark/pandas/spark/utils.py   | 19 +++
 python/pyspark/pandas/{sql.py => sql_processor.py} | 30 +++
 python/pyspark/pandas/strings.py   | 30 +++
 python/pyspark/pandas/typedef/typehints.py | 19 +++
 python/pyspark/pandas/usage_logging/__init__.py|  6 +--
 python/pyspark/pandas/utils.py | 28 ++
 python/pyspark/pandas/window.py| 28 ++
 34 files changed, 983 insertions(+), 28 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 9253d58..3abe206 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -161,6 +161,8 @@ jobs:
 pyspark-sql, pyspark-mllib, pyspark-resource
   - >-
 pyspark-core, pyspark-streaming, pyspark-ml
+  - >-
+pyspark-pandas
 env:
   MODULES_TO_TEST: ${{ matrix.modules }}
   HADOOP_PROFILE: hadoop3.2
diff --git a/dev/run-tests.py b/dev/run-tests.py
index 83f9f02..c5b412d 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -123,17 +123,17 @@ def determine_modules_to_test(changed_modules, 
deduplicated=True):
 >>> [x.name for x in determine_modules_to_test([modules.sql])]
 ... # doctest: +NORMALIZE_WHITESPACE
 ['sql', 'avro', 'hive', 'mllib', 'sql-kafka-0-10', 'examples', 
'hive-thriftserver',
- 'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-ml']
+ 'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-pandas', 
'pyspark-ml']
 >>> sorted([x.name for x in determine_modules_to_test(
 ... [modules.sparkr, modules.sql], deduplicated=False)])
 ... # doctest: +NORMALIZE_WHITESPACE
 ['avro', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'pyspark-ml',
- 'pyspark-mllib', 'pyspark-sql', 'repl', 'sparkr', 

[spark] branch master updated (3c7d6c3 -> f208d80)

2021-04-07 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3c7d6c3  [SPARK-27658][SQL] Add FunctionCatalog API
 add f208d80  [SPARK-34970][SQL][SERCURITY] Redact map-type options in the 
output of explain()

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/trees/TreeNode.scala | 17 ++-
 .../resources/sql-tests/results/describe.sql.out   |  2 +-
 .../scala/org/apache/spark/sql/ExplainSuite.scala  | 53 ++
 3 files changed, 69 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (06c09a7 -> 3c7d6c3)

2021-04-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 06c09a7  [SPARK-34969][SPARK-34906][SQL] Followup for Refactor 
TreeNode's children handling methods into specialized traits
 add 3c7d6c3  [SPARK-27658][SQL] Add FunctionCatalog API

No new revisions were added by this update.

Summary of changes:
 .../sql/connector/catalog/FunctionCatalog.java |  49 +++
 .../catalog/functions/AggregateFunction.java   |  94 +
 .../connector/catalog/functions/BoundFunction.java |  99 ++
 .../sql/connector/catalog/functions/Function.java} |  17 ++-
 .../catalog/functions/ScalarFunction.java  |  49 +++
 .../catalog/functions/UnboundFunction.java |  50 +++
 .../catalyst/analysis/NoSuchItemException.scala|  18 ++-
 .../catalog/functions/AggregateFunctionSuite.scala | 148 +
 8 files changed, 513 insertions(+), 11 deletions(-)
 create mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/FunctionCatalog.java
 create mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
 create mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
 copy sql/{core/src/main/java/org/apache/spark/sql/api/java/UDF12.java => 
catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Function.java}
 (74%)
 create mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java
 create mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/UnboundFunction.java
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (0aa2c28 -> 06c09a7)

2021-04-07 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0aa2c28  [SPARK-34678][SQL] Add table function registry
 add 06c09a7  [SPARK-34969][SPARK-34906][SQL] Followup for Refactor 
TreeNode's children handling methods into specialized traits

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/ml/stat/Summarizer.scala  |  8 --
 .../sql/catalyst/expressions/Expression.scala  | 13 +-
 .../catalyst/expressions/PartitionTransforms.scala |  2 +-
 .../catalyst/expressions/aggregate/Average.scala   |  2 +-
 .../catalyst/expressions/aggregate/CountIf.scala   |  2 +-
 .../expressions/aggregate/CountMinSketchAgg.scala  | 13 +++---
 .../expressions/aggregate/Covariance.scala | 12 -
 .../sql/catalyst/expressions/aggregate/Sum.scala   |  2 +-
 .../expressions/aggregate/bitwiseAggregates.scala  |  2 +-
 .../catalyst/expressions/aggregate/collect.scala   |  2 +-
 .../spark/sql/catalyst/expressions/grouping.scala  |  2 +-
 .../expressions/higherOrderFunctions.scala | 29 --
 .../sql/catalyst/expressions/mathExpressions.scala |  6 -
 .../catalyst/expressions/regexpExpressions.scala   |  6 -
 .../catalyst/expressions/stringExpressions.scala   |  7 --
 .../catalyst/expressions/windowExpressions.scala   |  2 +-
 .../sql/catalyst/plans/logical/v2Commands.scala|  2 +-
 .../apache/spark/sql/catalyst/trees/TreeNode.scala |  8 ++
 .../catalyst/expressions/CodeGenerationSuite.scala |  3 +--
 .../sql/execution/command/DataWritingCommand.scala |  6 ++---
 .../spark/sql/execution/command/commands.scala |  6 +++--
 .../datasources/v2/AddPartitionExec.scala  |  2 +-
 .../v2/AlterNamespaceSetPropertiesExec.scala   |  2 +-
 .../execution/datasources/v2/AlterTableExec.scala  |  2 +-
 .../execution/datasources/v2/CacheTableExec.scala  |  4 +--
 .../datasources/v2/CreateNamespaceExec.scala   |  2 +-
 .../execution/datasources/v2/CreateTableExec.scala |  2 +-
 .../datasources/v2/DeleteFromTableExec.scala   |  2 +-
 .../datasources/v2/DescribeColumnExec.scala|  2 +-
 .../datasources/v2/DescribeNamespaceExec.scala |  2 +-
 .../datasources/v2/DescribeTableExec.scala |  2 +-
 .../datasources/v2/DropNamespaceExec.scala |  2 +-
 .../datasources/v2/DropPartitionExec.scala |  2 +-
 .../execution/datasources/v2/DropTableExec.scala   |  2 +-
 .../datasources/v2/RefreshTableExec.scala  |  2 +-
 .../datasources/v2/RenamePartitionExec.scala   |  2 +-
 .../execution/datasources/v2/RenameTableExec.scala |  2 +-
 .../datasources/v2/ReplaceTableExec.scala  |  4 +--
 .../v2/SetCatalogAndNamespaceExec.scala|  2 +-
 .../datasources/v2/ShowCurrentNamespaceExec.scala  |  2 +-
 .../datasources/v2/ShowTablePropertiesExec.scala   |  2 +-
 .../datasources/v2/TruncatePartitionExec.scala |  2 +-
 .../datasources/v2/TruncateTableExec.scala |  2 +-
 .../datasources/v2/V1FallbackWriters.scala |  3 +--
 .../execution/datasources/v2/V2CommandExec.scala   |  5 ++--
 .../apache/spark/sql/DataFrameFunctionsSuite.scala |  5 ++--
 .../apache/spark/sql/GeneratorFunctionSuite.scala  |  4 +--
 .../spark/sql/TypedImperativeAggregateSuite.scala  |  8 +++---
 .../sql/hive/execution/TestingTypedCount.scala |  6 ++---
 49 files changed, 127 insertions(+), 87 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org