date:20220830

[spark] branch branch-3.3 updated: [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed

2022-08-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new c04aa36b771 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should 
work when the literal of In/InSet downcast failed
c04aa36b771 is described below

commit c04aa36b7713a7ebaf368fc2ad4065478e264d85
Author: Fu Chen 
AuthorDate: Wed Aug 31 13:32:17 2022 +0800

[SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the 
literal of In/InSet downcast failed

### Why are the changes needed?

This PR aims to fix the case

```scala
sql("create table t1(a decimal(3, 0)) using parquet")
sql("insert into t1 values(100), (10), (1)")
sql("select * from t1 where a in(10, 1.00)").show
```

```
java.lang.RuntimeException: After applying rule 
org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
Operator Optimization before Inferring Filters, the structural integrity of the 
plan is broken.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
```

1. the rule `UnwrapCastInBinaryComparison` transforms the expression `In` 
to Equals

```
CAST(a as decimal(12,2)) IN (10.00,1.00)

OR(
   CAST(a as decimal(12,2)) = 10.00,
   CAST(a as decimal(12,2)) = 1.00
)
```

2. using `UnwrapCastInBinaryComparison.unwrapCast()` to optimize each 
`EqualTo`

```
// Expression1
CAST(a as decimal(12,2)) = 10.00 => CAST(a as decimal(12,2)) = 10.00

// Expression2
CAST(a as decimal(12,2)) = 1.00  => a = 1
```

3. return the new unwrapped cast expression `In`

```
a IN (10.00, 1.00)
```

Before this PR:

the method `UnwrapCastInBinaryComparison.unwrapCast()` returns the original 
expression when downcasting to a decimal type fails (the `Expression1`),returns 
the original expression if the downcast to the decimal type succeeds (the 
`Expression2`), the two expressions have different data type which would break 
the structural integrity

```
a IN (10.00, 1.00)
   |   |
decimal(12, 2) |
   decimal(3, 0)
```

After this PR:

the PR transform the downcasting failed expression to 
`falseIfNotNull(fromExp)`
```

((isnull(a) AND null) OR a IN (1.00)
```

### Does this PR introduce _any_ user-facing change?

No, only bug fix.

### How was this patch tested?

Unit test.

Closes #37439 from cfmcgrady/SPARK-39896.

Authored-by: Fu Chen 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 6e62b93f3d1ef7e2d6be0a3bb729ab9b2d55a36d)
Signed-off-by: Wenchen Fan 
---
 .../optimizer/UnwrapCastInBinaryComparison.scala   | 131 ++---
 .../UnwrapCastInBinaryComparisonSuite.scala|  68 ---
 2 files changed, 113 insertions(+), 86 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
index 94e27379b74..f4a92760d22 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
@@ -17,7 +17,6 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import scala.collection.immutable.HashSet
 import scala.collection.mutable.ArrayBuffer
 
 import org.apache.spark.sql.catalyst.expressions._
@@ -145,80 +144,28 @@ object UnwrapCastInBinaryComparison extends 
Rule[LogicalPlan] {
 case in @ In(Cast(fromExp, toType: NumericType, _, _), list @ 
Seq(firstLit, _*))
   if canImplicitlyCast(fromExp, toType, firstLit.dataType) && 
in.inSetConvertible =>
 
-  // There are 3 kinds of literals in the list:
-  // 1. null literals
-  // 2. The literals that can cast to fromExp.dataType
-  // 3. The literals that cannot cast to fromExp.dataType
-  // null literals is special as we can cast null literals to any data 
type.
-  val (nullList, canCastList, cannotCastList) =
-(ArrayBuffer[Literal](), ArrayBuffer[Literal](), 
ArrayBuffer[Expression]())
-  list.foreach {
-case lit @ Literal(null, _) => nullList += lit
-case lit @ NonNullLiteral(_, _) =>
-  unwrapCast(EqualTo(in.value, lit)) match {
-case EqualTo(_, unwrapLit: Literal) => canCastList += unwrapLit
-case e @ And(IsNull(_), Literal(null, BooleanType)) => 
cannotCastList += e
-case _ => throw new

[spark] branch master updated: [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed

2022-08-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6e62b93f3d1 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should 
work when the literal of In/InSet downcast failed
6e62b93f3d1 is described below

commit 6e62b93f3d1ef7e2d6be0a3bb729ab9b2d55a36d
Author: Fu Chen 
AuthorDate: Wed Aug 31 13:32:17 2022 +0800

[SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the 
literal of In/InSet downcast failed

### Why are the changes needed?

This PR aims to fix the case

```scala
sql("create table t1(a decimal(3, 0)) using parquet")
sql("insert into t1 values(100), (10), (1)")
sql("select * from t1 where a in(10, 1.00)").show
```

```
java.lang.RuntimeException: After applying rule 
org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
Operator Optimization before Inferring Filters, the structural integrity of the 
plan is broken.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
```

1. the rule `UnwrapCastInBinaryComparison` transforms the expression `In` 
to Equals

```
CAST(a as decimal(12,2)) IN (10.00,1.00)

OR(
   CAST(a as decimal(12,2)) = 10.00,
   CAST(a as decimal(12,2)) = 1.00
)
```

2. using `UnwrapCastInBinaryComparison.unwrapCast()` to optimize each 
`EqualTo`

```
// Expression1
CAST(a as decimal(12,2)) = 10.00 => CAST(a as decimal(12,2)) = 10.00

// Expression2
CAST(a as decimal(12,2)) = 1.00  => a = 1
```

3. return the new unwrapped cast expression `In`

```
a IN (10.00, 1.00)
```

Before this PR:

the method `UnwrapCastInBinaryComparison.unwrapCast()` returns the original 
expression when downcasting to a decimal type fails (the `Expression1`),returns 
the original expression if the downcast to the decimal type succeeds (the 
`Expression2`), the two expressions have different data type which would break 
the structural integrity

```
a IN (10.00, 1.00)
   |   |
decimal(12, 2) |
   decimal(3, 0)
```

After this PR:

the PR transform the downcasting failed expression to 
`falseIfNotNull(fromExp)`
```

((isnull(a) AND null) OR a IN (1.00)
```

### Does this PR introduce _any_ user-facing change?

No, only bug fix.

### How was this patch tested?

Unit test.

Closes #37439 from cfmcgrady/SPARK-39896.

Authored-by: Fu Chen 
Signed-off-by: Wenchen Fan 
---
 .../optimizer/UnwrapCastInBinaryComparison.scala   | 131 ++---
 .../UnwrapCastInBinaryComparisonSuite.scala|  68 ---
 2 files changed, 113 insertions(+), 86 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
index 94e27379b74..f4a92760d22 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
@@ -17,7 +17,6 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import scala.collection.immutable.HashSet
 import scala.collection.mutable.ArrayBuffer
 
 import org.apache.spark.sql.catalyst.expressions._
@@ -145,80 +144,28 @@ object UnwrapCastInBinaryComparison extends 
Rule[LogicalPlan] {
 case in @ In(Cast(fromExp, toType: NumericType, _, _), list @ 
Seq(firstLit, _*))
   if canImplicitlyCast(fromExp, toType, firstLit.dataType) && 
in.inSetConvertible =>
 
-  // There are 3 kinds of literals in the list:
-  // 1. null literals
-  // 2. The literals that can cast to fromExp.dataType
-  // 3. The literals that cannot cast to fromExp.dataType
-  // null literals is special as we can cast null literals to any data 
type.
-  val (nullList, canCastList, cannotCastList) =
-(ArrayBuffer[Literal](), ArrayBuffer[Literal](), 
ArrayBuffer[Expression]())
-  list.foreach {
-case lit @ Literal(null, _) => nullList += lit
-case lit @ NonNullLiteral(_, _) =>
-  unwrapCast(EqualTo(in.value, lit)) match {
-case EqualTo(_, unwrapLit: Literal) => canCastList += unwrapLit
-case e @ And(IsNull(_), Literal(null, BooleanType)) => 
cannotCastList += e
-case _ => throw new IllegalStateException("Illegal unwrap cast 
result found.")
-  }
-case _ => throw new IllegalStateException("Illegal value

[spark] branch master updated: [SPARK-40271][PYTHON] Support list type for `pyspark.sql.functions.lit`

2022-08-30 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 65d89f8e897 [SPARK-40271][PYTHON] Support list type for 
`pyspark.sql.functions.lit`
65d89f8e897 is described below

commit 65d89f8e897449f7f026144a76328ff545fecde2
Author: itholic 
AuthorDate: Wed Aug 31 11:37:20 2022 +0800

[SPARK-40271][PYTHON] Support list type for `pyspark.sql.functions.lit`

### What changes were proposed in this pull request?

This PR proposes to support `list` type for `pyspark.sql.functions.lit`.

### Why are the changes needed?

To improve the API usability.

### Does this PR introduce _any_ user-facing change?

Yes, now the `list` type is available for `pyspark.sql.functions.list` as 
below:

- Before
```python
>>> spark.range(3).withColumn("c", lit([1,2,3])).show()
Traceback (most recent call last):
...
: org.apache.spark.SparkRuntimeException: 
[UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for 
'[1, 2, 3]' of class java.util.ArrayList.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
at org.apache.spark.sql.functions$.lit(functions.scala:125)
at org.apache.spark.sql.functions.lit(functions.scala)
at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:577)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
```

- After
```python
>>> spark.range(3).withColumn("c", lit([1,2,3])).show()
+---+-+
| id|c|
+---+-+
|  0|[1, 2, 3]|
|  1|[1, 2, 3]|
|  2|[1, 2, 3]|
+---+-+
```

### How was this patch tested?

Added doctest & unit test.

Closes #37722 from itholic/SPARK-40271.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions.py| 23 +--
 python/pyspark/sql/tests/test_functions.py | 26 ++
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 03c16db602f..e7a7a1b37f3 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -131,10 +131,13 @@ def lit(col: Any) -> Column:
 
 Parameters
 --
-col : :class:`~pyspark.sql.Column` or Python primitive type.
+col : :class:`~pyspark.sql.Column`, str, int, float, bool or list.
 the value to make it as a PySpark literal. If a column is passed,
 it returns the column as is.
 
+.. versionchanged:: 3.4.0
+Since 3.4.0, it supports the list type.
+
 Returns
 ---
 :class:`~pyspark.sql.Column`
@@ -149,8 +152,24 @@ def lit(col: Any) -> Column:
 +--+---+
 | 5|  0|
 +--+---+
+
+Create a literal from a list.
+
+>>> spark.range(1).select(lit([1, 2, 3])).show()
++--+
+|array(1, 2, 3)|
++--+
+| [1, 2, 3]|
++--+
 """
-return col if isinstance(col, Column) else _invoke_function("lit", col)
+if isinstance(col, Column):
+return col
+elif isinstance(col, list):
+if any(isinstance(c, Column) for c in col):
+raise ValueError("lit does not allow for list of Columns")
+return array(*[lit(item) for item in col])
+else:
+return _invoke_function("lit", col)
 
 
 def col(col: str) -> Column:
diff --git a/python/pyspark/sql/tests/test_functions.py 
b/python/pyspark/sql/tests/test_functions.py
index 102ebef8317..1d02a540558 100644
--- a/python/pyspark/sql/tests/test_functions.py
+++ b/python/pyspark/sql/tests/test_functions.py
@@ -962,6 +962,32 @@ class FunctionsTests(ReusedSQLTestCase):
 actual = self.spark.range(1).select(lit(td)).first()[0]
 self.assertEqual(actual, td)
 
+def test_lit_list(self):
+# SPARK-40271: added list type supporting
+test_list = [1, 2, 3]
+expected =

[spark] branch master updated (296fe49ec85 -> c2d4c4ed4c5)

2022-08-30 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 296fe49ec85 [SPARK-40260][SQL] Use error classes in the compilation 
errors of GROUP BY a position
 add c2d4c4ed4c5 [SPARK-40256][BUILD][K8S] Switch base image from openjdk 
to eclipse-temurin

No new revisions were added by this update.

Summary of changes:
 bin/docker-image-tool.sh | 4 ++--
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile  | 5 ++---
 .../integration-tests/scripts/setup-integration-test-env.sh  | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40260][SQL] Use error classes in the compilation errors of GROUP BY a position

2022-08-30 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 296fe49ec85 [SPARK-40260][SQL] Use error classes in the compilation 
errors of GROUP BY a position
296fe49ec85 is described below

commit 296fe49ec855ac8c15c080e7bab6d519fe504bd3
Author: Max Gekk 
AuthorDate: Tue Aug 30 20:43:19 2022 +0300

[SPARK-40260][SQL] Use error classes in the compilation errors of GROUP BY 
a position

### What changes were proposed in this pull request?
In the PR, I propose to the following new error classes:
- GROUP_BY_POS_OUT_OF_RANGE
- GROUP_BY_POS_REFERS_AGG_EXPR

and migrate 2 compilation exceptions related to GROUP BY a position onto 
them.

### Why are the changes needed?
The migration onto error classes makes the errors searchable in docs, and 
allows to edit error's text messages w/o modifying the source code.

### Does this PR introduce _any_ user-facing change?
Yes, in some sense because it modifies user-facing error messages.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "core/testOnly *SparkThrowableSuite"
```

Closes #37712 from MaxGekk/group-ref-agg-error.

Lead-authored-by: Max Gekk 
Co-authored-by: Maxim Gekk 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 12 
 .../org/apache/spark/sql/AnalysisException.scala   |  2 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  | 11 +--
 .../sql-tests/results/group-by-ordinal.sql.out | 81 +++---
 .../results/postgreSQL/select_implicit.sql.out |  9 ++-
 .../udf/postgreSQL/udf-select_implicit.sql.out |  9 ++-
 6 files changed, 107 insertions(+), 17 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 816df79e508..df0f887a63c 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -136,6 +136,18 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_POS_OUT_OF_RANGE" : {
+"message" : [
+  "GROUP BY position  is not in select list (valid range is [1, 
])."
+],
+"sqlState" : "42000"
+  },
+  "GROUP_BY_POS_REFERS_AGG_EXPR" : {
+"message" : [
+  "GROUP BY  refers to an expression  that contains an 
aggregate function. Aggregate functions are not allowed in GROUP BY."
+],
+"sqlState" : "42000"
+  },
   "INCOMPARABLE_PIVOT_COLUMN" : {
 "message" : [
   "Invalid pivot column . Pivot columns must be comparable."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala
index 9ab0b223e11..48e1f91990b 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala
@@ -100,7 +100,7 @@ class AnalysisException protected[sql] (
   line = origin.line,
   startPosition = origin.startPosition,
   errorClass = Some(errorClass),
-  errorSubClass = Some(errorSubClass),
+  errorSubClass = Option(errorSubClass),
   messageParameters = messageParameters)
 
   def copy(
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index 20c3c81b250..7458e201be2 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
@@ -366,14 +366,15 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
   def groupByPositionRefersToAggregateFunctionError(
   index: Int,
   expr: Expression): Throwable = {
-new AnalysisException(s"GROUP BY $index refers to an expression that is or 
contains " +
-  "an aggregate function. Aggregate functions are not allowed in GROUP BY, 
" +
-  s"but got ${expr.sql}")
+new AnalysisException(
+  errorClass = "GROUP_BY_POS_REFERS_AGG_EXPR",
+  messageParameters = Array(index.toString, expr.sql))
   }
 
   def groupByPositionRangeError(index: Int, size: Int): Throwable = {
-new AnalysisException(s"GROUP BY position $index is not in select list " +
-  s"(valid range is [1, $size])")
+new AnalysisException(
+  errorClass = "GROUP_BY_POS_OUT_OF_RANGE",
+  messageParameters = Array(index.toString, size.toString))
   }
 
   def generatorNotExpectedError(name: FunctionIdentifier, classCanonicalName: 
String): Throwable = {
diff --git

[spark] branch master updated (64dd81d97da -> cee30cb4994)

2022-08-30 Thread huaxingao

This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 64dd81d97da [SPARK-40056][BUILD] Upgrade mvn-scalafmt from 1.0.4 to 
1.1.1640084764.9f463a9
 add cee30cb4994 [SPARK-40113][SQL] Reactor ParquetScanBuilder DataSourceV2 
interface implementations

No new revisions were added by this update.

Summary of changes:
 .../v2/parquet/ParquetScanBuilder.scala| 26 --
 1 file changed, 9 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c5e93eb6631 -> 64dd81d97da)

2022-08-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c5e93eb6631 [SPARK-40207][SQL] Specify the column name when the data 
type is not supported by datasource
 add 64dd81d97da [SPARK-40056][BUILD] Upgrade mvn-scalafmt from 1.0.4 to 
1.1.1640084764.9f463a9

No new revisions were added by this update.

Summary of changes:
 dev/.scalafmt.conf | 11 +--
 dev/scalafmt   |  2 +-
 pom.xml|  2 +-
 3 files changed, 11 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40207][SQL] Specify the column name when the data type is not supported by datasource

2022-08-30 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c5e93eb6631 [SPARK-40207][SQL] Specify the column name when the data 
type is not supported by datasource
c5e93eb6631 is described below

commit c5e93eb6631b2620908cef0fa3b100f6e3569d68
Author: yikf 
AuthorDate: Tue Aug 30 19:05:23 2022 +0800

[SPARK-40207][SQL] Specify the column name when the data type is not 
supported by datasource

### What changes were proposed in this pull request?

Currently, If the data type is not supported by the data source, the 
exception message thrown does not contain the column name, which is less clear 
for locating the problem

This is a minor fix and aims to specify the column name when the data type 
is not supported by datasource

### Why are the changes needed?

More explicit error messages

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #37574 from Yikf/void-type.

Authored-by: yikf 
Signed-off-by: Yuming Wang 
---
 .../org/apache/spark/sql/avro/AvroSuite.scala  |  6 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  |  5 +-
 .../spark/sql/FileBasedDataSourceSuite.scala   | 68 +-
 .../spark/sql/hive/execution/HiveDDLSuite.scala|  4 +-
 .../spark/sql/hive/orc/HiveOrcSourceSuite.scala| 16 +++--
 5 files changed, 60 insertions(+), 39 deletions(-)

diff --git 
a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala 
b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
index bfdeb11fd8a..5e161948932 100644
--- a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
+++ b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
@@ -1181,14 +1181,16 @@ abstract class AvroSuite
   sql("select interval 1 
days").write.format("avro").mode("overwrite").save(tempDir)
 }.getMessage
 assert(msg.contains("Cannot save interval data type into external 
storage.") ||
-  msg.contains("AVRO data source does not support interval data 
type."))
+  msg.contains("Column `INTERVAL '1' DAY` has a data type of interval 
day, " +
+"which is not supported by Avro."))
 
 msg = intercept[AnalysisException] {
   spark.udf.register("testType", () => new IntervalData())
   sql("select 
testType()").write.format("avro").mode("overwrite").save(tempDir)
 }.getMessage
 assert(msg.toLowerCase(Locale.ROOT)
-  .contains(s"avro data source does not support interval data type."))
+  .contains("column `testtype()` has a data type of interval, " +
+"which is not supported by avro."))
   }
 }
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index 834e0e6b214..20c3c81b250 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
@@ -1190,8 +1190,9 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
   }
 
   def dataTypeUnsupportedByDataSourceError(format: String, field: 
StructField): Throwable = {
-new AnalysisException(
-  s"$format data source does not support ${field.dataType.catalogString} 
data type.")
+new AnalysisException(s"Column `${field.name}` has a data type of " +
+  s"${field.dataType.catalogString}, which is not supported by $format."
+)
   }
 
   def failToResolveDataSourceForTableError(table: CatalogTable, key: String): 
Throwable = {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
index 15d367dba88..98cb54ccbbc 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
@@ -257,38 +257,44 @@ class FileBasedDataSourceSuite extends QueryTest
   // Text file format only supports string type
   test("SPARK-24691 error handling for unsupported types - text") {
 withTempDir { dir =>
+  def validateErrorMessage(msg: String, column: String, dt: String, 
format: String): Unit = {
+val excepted = s"Column `$column` has a data type of $dt, " +
+  s"which is not supported by $format."
+assert(msg.contains(excepted))
+  }
+
   // write path
   val textDir = new File(dir, "text").getCanonicalPath
   var msg = intercept[AnalysisException] {
 Seq(1).toDF.write.text(textDir)

[spark] branch master updated: [SPARK-40156][SQL][TESTS][FOLLOW-UP] `url_decode()` should the return an error class

2022-08-30 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7a5ce219d0d [SPARK-40156][SQL][TESTS][FOLLOW-UP] `url_decode()` should 
the return an error class
7a5ce219d0d is described below

commit 7a5ce219d0d755235931bb1e36e0195884f58588
Author: zhiming she <505306...@qq.com>
AuthorDate: Tue Aug 30 11:14:30 2022 +0300

[SPARK-40156][SQL][TESTS][FOLLOW-UP] `url_decode()` should the return an 
error class

### What changes were proposed in this pull request?

url_decode() return an error class when Invalid parameter input. like :
```
spark.sql("SELECT url_decode('http%3A%2F%2spark.apache.org')").show
```

output:
```
org.apache.spark.SparkIllegalArgumentException: [CANNOT_DECODE_URL] Cannot 
decode url : http%3A%2F%2spark.apache.org.
```

### Why are the changes needed?

To improve user experience w/ Spark SQL.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
url-functions.sql"
```

Closes #37709 from ming95/SPARK-40156-test.

Authored-by: zhiming she <505306...@qq.com>
Signed-off-by: Max Gekk 
---
 .../main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala  | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index d8d6139e919..8cb31f45c25 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -325,8 +325,7 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
   s"If necessary set ${SQLConf.ANSI_ENABLED.key} to false to bypass this 
error.", e)
   }
 
-  def illegalUrlError(url: UTF8String):
-  Throwable with SparkThrowable = {
+  def illegalUrlError(url: UTF8String): Throwable = {
 new SparkIllegalArgumentException(errorClass = "CANNOT_DECODE_URL",
   messageParameters = Array(url.toString)
 )


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (6deb8aa386a -> 44213742315)

2022-08-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6deb8aa386a [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib
 add 44213742315 [SPARK-40266][DOCS] Datatype Integer instead of Long in 
quick-start docs

No new revisions were added by this update.

Summary of changes:
 docs/quick-start.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib

2022-08-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6deb8aa386a [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib
6deb8aa386a is described below

commit 6deb8aa386a9c3e16abd58a189976709675c2341
Author: Ruifeng Zheng 
AuthorDate: Tue Aug 30 16:35:06 2022 +0900

[SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib

### What changes were proposed in this pull request?
Delete the license of fommil-netlib

### Why are the changes needed?
after upgrading breeze to 2.0 in 
https://github.com/apache/spark/pull/37002, fommil-netlib no longer exists in 
spark build

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #37717 from zhengruifeng/del_netlib_license.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 licenses-binary/LICENSE-netlib.txt | 49 --
 1 file changed, 49 deletions(-)

diff --git a/licenses-binary/LICENSE-netlib.txt 
b/licenses-binary/LICENSE-netlib.txt
deleted file mode 100644
index 75783ed6bc3..000
--- a/licenses-binary/LICENSE-netlib.txt
+++ /dev/null
@@ -1,49 +0,0 @@
-Copyright (c) 2013 Samuel Halliday
-Copyright (c) 1992-2011 The University of Tennessee and The University
-of Tennessee Research Foundation.  All rights
-reserved.
-Copyright (c) 2000-2011 The University of California Berkeley. All
-rights reserved.
-Copyright (c) 2006-2011 The University of Colorado Denver.  All rights
-reserved.
-
-$COPYRIGHT$
-
-Additional copyrights may follow
-
-$HEADER$
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are
-met:
-
-- Redistributions of source code must retain the above copyright
-  notice, this list of conditions and the following disclaimer.
-
-- Redistributions in binary form must reproduce the above copyright
-  notice, this list of conditions and the following disclaimer listed
-  in this license in the documentation and/or other materials
-  provided with the distribution.
-
-- Neither the name of the copyright holders nor the names of its
-  contributors may be used to endorse or promote products derived from
-  this software without specific prior written permission.
-
-The copyright holders provide no reassurances that the source code
-provided does not infringe any patent, copyright, or any other
-intellectual property rights of third parties.  The copyright holders
-disclaim any liability to any recipient for claims brought against
-recipient by any third party for infringement of that parties
-intellectual property rights.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
-OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
-SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
-LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
-THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

2022-08-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 58375a86e6f [SPARK-40270][PS] Make 'compute.max_rows' as None working 
in DataFrame.style
58375a86e6f is described below

commit 58375a86e6ff49c5bcee49939fbd98eb848ae59f
Author: Hyukjin Kwon 
AuthorDate: Tue Aug 30 16:25:26 2022 +0900

[SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

This PR make `compute.max_rows` option as `None` working in 
`DataFrame.style`, as expected instead of throwing an exception., by collecting 
it all to a pandas DataFrame.

To make the configuration working as expected.

Yes.

```python
import pyspark.pandas as ps
ps.set_option("compute.max_rows", None)
ps.get_option("compute.max_rows")
ps.range(1).style
```

**Before:**

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
pdf = self.head(max_results + 1)._to_internal_pandas()
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
```

**After:**

```

```

Manually tested, and unittest was added.

Closes #37718 from HyukjinKwon/SPARK-40270.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/frame.py| 16 +---
 python/pyspark/pandas/tests/test_dataframe.py | 23 +++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index efc677b33ce..fd112357bdd 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -3503,19 +3503,21 @@ defaultdict(, {'col..., 'col...})]
 Property returning a Styler object containing methods for
 building a styled HTML representation for the DataFrame.
 
-.. note:: currently it collects top 1000 rows and return its
-pandas `pandas.io.formats.style.Styler` instance.
-
 Examples
 
 >>> ps.range(1001).style  # doctest: +SKIP
 
 """
 max_results = get_option("compute.max_rows")
-pdf = self.head(max_results + 1)._to_internal_pandas()
-if len(pdf) > max_results:
-warnings.warn("'style' property will only use top %s rows." % 
max_results, UserWarning)
-return pdf.head(max_results).style
+if max_results is not None:
+pdf = self.head(max_results + 1)._to_internal_pandas()
+if len(pdf) > max_results:
+warnings.warn(
+"'style' property will only use top %s rows." % 
max_results, UserWarning
+)
+return pdf.head(max_results).style
+else:
+return self._to_internal_pandas().style
 
 def set_index(
 self,
diff --git a/python/pyspark/pandas/tests/test_dataframe.py 
b/python/pyspark/pandas/tests/test_dataframe.py
index 27c670026d0..b4187d59ae7 100644
--- a/python/pyspark/pandas/tests/test_dataframe.py
+++ b/python/pyspark/pandas/tests/test_dataframe.py
@@ -5774,6 +5774,29 @@ class DataFrameTest(PandasOnSparkTestCase, SQLTestUtils):
 for value_psdf, value_pdf in zip(psdf, pdf):
 self.assert_eq(value_psdf, value_pdf)
 
+def test_style(self):
+# Currently, the `style` function returns a pandas object `Styler` as 
it is,
+# processing only the number of rows declared in `compute.max_rows`.
+# So it's a bit vague to test, but we are doing minimal tests instead 
of not testing at all.
+pdf = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", 
"D"])
+psdf = ps.from_pandas(pdf)
+
+def style_negative(v, props=""):
+return props if v < 0 else None
+
+def check_style():
+# If the value is negative, the text color will be displayed as 
red.
+pdf_style = pdf.style.applymap(style_negative, props="color:red;")
+psdf_style = psdf.style.applymap(style_negative, 
props="color:red;")
+
+# Test whether the same shape as pandas table is created including 
the color.
+self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex())
+
+check_style()
+
+with ps.option_context("compute.max_rows", None):
+check_style()
+
 
 if __name__ == "__main__":
 from pyspark.pandas.tests.test_dataframe import *  # noqa: F401


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail:

[spark] branch branch-3.3 updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

2022-08-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new e46d2e2d476 [SPARK-40270][PS] Make 'compute.max_rows' as None working 
in DataFrame.style
e46d2e2d476 is described below

commit e46d2e2d476e85024f1c53fdaf446fdd2e293d28
Author: Hyukjin Kwon 
AuthorDate: Tue Aug 30 16:25:26 2022 +0900

[SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

This PR make `compute.max_rows` option as `None` working in 
`DataFrame.style`, as expected instead of throwing an exception., by collecting 
it all to a pandas DataFrame.

To make the configuration working as expected.

Yes.

```python
import pyspark.pandas as ps
ps.set_option("compute.max_rows", None)
ps.get_option("compute.max_rows")
ps.range(1).style
```

**Before:**

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
pdf = self.head(max_results + 1)._to_internal_pandas()
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
```

**After:**

```

```

Manually tested, and unittest was added.

Closes #37718 from HyukjinKwon/SPARK-40270.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/frame.py| 16 +---
 python/pyspark/pandas/tests/test_dataframe.py | 23 +++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index 6e8f69ad6e7..e9c5cbb9c1e 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -3459,19 +3459,21 @@ defaultdict(, {'col..., 'col...})]
 Property returning a Styler object containing methods for
 building a styled HTML representation for the DataFrame.
 
-.. note:: currently it collects top 1000 rows and return its
-pandas `pandas.io.formats.style.Styler` instance.
-
 Examples
 
 >>> ps.range(1001).style  # doctest: +SKIP
 
 """
 max_results = get_option("compute.max_rows")
-pdf = self.head(max_results + 1)._to_internal_pandas()
-if len(pdf) > max_results:
-warnings.warn("'style' property will only use top %s rows." % 
max_results, UserWarning)
-return pdf.head(max_results).style
+if max_results is not None:
+pdf = self.head(max_results + 1)._to_internal_pandas()
+if len(pdf) > max_results:
+warnings.warn(
+"'style' property will only use top %s rows." % 
max_results, UserWarning
+)
+return pdf.head(max_results).style
+else:
+return self._to_internal_pandas().style
 
 def set_index(
 self,
diff --git a/python/pyspark/pandas/tests/test_dataframe.py 
b/python/pyspark/pandas/tests/test_dataframe.py
index 1cc03bf06f8..0a7eda77564 100644
--- a/python/pyspark/pandas/tests/test_dataframe.py
+++ b/python/pyspark/pandas/tests/test_dataframe.py
@@ -6375,6 +6375,29 @@ class DataFrameTest(ComparisonTestBase, SQLTestUtils):
 psdf = ps.from_pandas(pdf)
 self.assert_eq(pdf.cov(), psdf.cov())
 
+def test_style(self):
+# Currently, the `style` function returns a pandas object `Styler` as 
it is,
+# processing only the number of rows declared in `compute.max_rows`.
+# So it's a bit vague to test, but we are doing minimal tests instead 
of not testing at all.
+pdf = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", 
"D"])
+psdf = ps.from_pandas(pdf)
+
+def style_negative(v, props=""):
+return props if v < 0 else None
+
+def check_style():
+# If the value is negative, the text color will be displayed as 
red.
+pdf_style = pdf.style.applymap(style_negative, props="color:red;")
+psdf_style = psdf.style.applymap(style_negative, 
props="color:red;")
+
+# Test whether the same shape as pandas table is created including 
the color.
+self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex())
+
+check_style()
+
+with ps.option_context("compute.max_rows", None):
+check_style()
+
 
 if __name__ == "__main__":
 from pyspark.pandas.tests.test_dataframe import *  # noqa: F401


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

2022-08-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0f0e8cc26b6 [SPARK-40270][PS] Make 'compute.max_rows' as None working 
in DataFrame.style
0f0e8cc26b6 is described below

commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64
Author: Hyukjin Kwon 
AuthorDate: Tue Aug 30 16:25:26 2022 +0900

[SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

### What changes were proposed in this pull request?

This PR make `compute.max_rows` option as `None` working in 
`DataFrame.style`, as expected instead of throwing an exception., by collecting 
it all to a pandas DataFrame.

### Why are the changes needed?

To make the configuration working as expected.

### Does this PR introduce _any_ user-facing change?
Yes.

```python
import pyspark.pandas as ps
ps.set_option("compute.max_rows", None)
ps.get_option("compute.max_rows")
ps.range(1).style
```

**Before:**

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
pdf = self.head(max_results + 1)._to_internal_pandas()
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
```

**After:**

```

```

### How was this patch tested?

Manually tested, and unittest was added.

Closes #37718 from HyukjinKwon/SPARK-40270.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/frame.py| 16 +---
 python/pyspark/pandas/tests/test_dataframe.py | 16 +++-
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index ba5df94c86c..8fc425e88a3 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -3754,19 +3754,21 @@ defaultdict(, {'col..., 'col...})]
 Property returning a Styler object containing methods for
 building a styled HTML representation for the DataFrame.
 
-.. note:: currently it collects top 1000 rows and return its
-pandas `pandas.io.formats.style.Styler` instance.
-
 Examples
 
 >>> ps.range(1001).style  # doctest: +SKIP
 
 """
 max_results = get_option("compute.max_rows")
-pdf = self.head(max_results + 1)._to_internal_pandas()
-if len(pdf) > max_results:
-warnings.warn("'style' property will only use top %s rows." % 
max_results, UserWarning)
-return pdf.head(max_results).style
+if max_results is not None:
+pdf = self.head(max_results + 1)._to_internal_pandas()
+if len(pdf) > max_results:
+warnings.warn(
+"'style' property will only use top %s rows." % 
max_results, UserWarning
+)
+return pdf.head(max_results).style
+else:
+return self._to_internal_pandas().style
 
 def set_index(
 self,
diff --git a/python/pyspark/pandas/tests/test_dataframe.py 
b/python/pyspark/pandas/tests/test_dataframe.py
index 2ab908fed00..34480152f8c 100644
--- a/python/pyspark/pandas/tests/test_dataframe.py
+++ b/python/pyspark/pandas/tests/test_dataframe.py
@@ -6904,12 +6904,18 @@ class DataFrameTest(ComparisonTestBase, SQLTestUtils):
 def style_negative(v, props=""):
 return props if v < 0 else None
 
-# If the value is negative, the text color will be displayed as red.
-pdf_style = pdf.style.applymap(style_negative, props="color:red;")
-psdf_style = psdf.style.applymap(style_negative, props="color:red;")
+def check_style():
+# If the value is negative, the text color will be displayed as 
red.
+pdf_style = pdf.style.applymap(style_negative, props="color:red;")
+psdf_style = psdf.style.applymap(style_negative, 
props="color:red;")
 
-# Test whether the same shape as pandas table is created including the 
color.
-self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex())
+# Test whether the same shape as pandas table is created including 
the color.
+self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex())
+
+check_style()
+
+with ps.option_context("compute.max_rows", None):
+check_style()
 
 
 if __name__ == "__main__":


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part

2022-08-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ff7ab34a695 [SPARK-39915][SQL] Dataset.repartition(N) may not create N 
partitions Non-AQE part
ff7ab34a695 is described below

commit ff7ab34a6957965d74504ed1d36de21d1ad7319c
Author: ulysses-you 
AuthorDate: Tue Aug 30 14:30:41 2022 +0800

[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions 
Non-AQE part

### What changes were proposed in this pull request?

Skip optimize the root user-specified repartition in 
`PropagateEmptyRelation`.

### Why are the changes needed?

Spark should preserve the final repatition which can affect the final 
output partition which is user-specified.

For example:

```scala
spark.sql("select * from values(1) where 1 < rand()").repartition(1)

// before:
== Optimized Logical Plan ==
LocalTableScan , [col1#0]

// after:
== Optimized Logical Plan ==
Repartition 1, true
+- LocalRelation , [col1#0]
```

### Does this PR introduce _any_ user-facing change?

yes, the empty plan may change

### How was this patch tested?

add test

Closes #37706 from ulysses-you/empty.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/catalyst/dsl/package.scala|  3 ++
 .../optimizer/PropagateEmptyRelation.scala | 42 --
 .../optimizer/PropagateEmptyRelationSuite.scala| 38 
 .../adaptive/AQEPropagateEmptyRelation.scala   |  2 +-
 .../org/apache/spark/sql/DataFrameSuite.scala  |  7 
 5 files changed, 88 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
index 118d3e85b71..86d85abc6f3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
@@ -501,6 +501,9 @@ package object dsl {
   def repartition(num: Integer): LogicalPlan =
 Repartition(num, shuffle = true, logicalPlan)
 
+  def repartition(): LogicalPlan =
+RepartitionByExpression(Seq.empty, logicalPlan, None)
+
   def distribute(exprs: Expression*)(n: Int): LogicalPlan =
 RepartitionByExpression(exprs, logicalPlan, numPartitions = n)
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
index 18c344f10f6..9e864d036ef 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
@@ -23,6 +23,7 @@ import 
org.apache.spark.sql.catalyst.expressions.Literal.FalseLiteral
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.trees.TreeNodeTag
 import org.apache.spark.sql.catalyst.trees.TreePattern.{LOCAL_RELATION, 
TRUE_OR_FALSE_LITERAL}
 
 /**
@@ -44,6 +45,9 @@ import 
org.apache.spark.sql.catalyst.trees.TreePattern.{LOCAL_RELATION, TRUE_OR_
  * - Generate(Explode) with all empty children. Others like Hive UDTF may 
return results.
  */
 abstract class PropagateEmptyRelationBase extends Rule[LogicalPlan] with 
CastSupport {
+  // This tag is used to mark a repartition as a root repartition which is 
user-specified
+  private[sql] val ROOT_REPARTITION = TreeNodeTag[Unit]("ROOT_REPARTITION")
+
   protected def isEmpty(plan: LogicalPlan): Boolean = plan match {
 case p: LocalRelation => p.data.isEmpty
 case _ => false
@@ -137,8 +141,13 @@ abstract class PropagateEmptyRelationBase extends 
Rule[LogicalPlan] with CastSup
   case _: GlobalLimit if !p.isStreaming => empty(p)
   case _: LocalLimit if !p.isStreaming => empty(p)
   case _: Offset => empty(p)
-  case _: Repartition => empty(p)
-  case _: RepartitionByExpression => empty(p)
+  case _: RepartitionOperation =>
+if (p.getTagValue(ROOT_REPARTITION).isEmpty) {
+  empty(p)
+} else {
+  p.unsetTagValue(ROOT_REPARTITION)
+  p
+}
   case _: RebalancePartitions => empty(p)
   // An aggregate with non-empty group expression will return one output 
row per group when the
   // input to the aggregate is not empty. If the input to the aggregate is 
empty then all groups
@@ -162,13 +171,40 @@ abstract class PropagateEmptyRelationBase extends 
Rule[LogicalPlan] with CastSup

[spark] branch master updated (ca17135b288 -> e0cb2eb104e)

2022-08-30 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ca17135b288 [SPARK-34777][UI] Fix StagePage input size/records not 
show when records greater than zero
 add e0cb2eb104e [SPARK-40135][PS] Support `data` mixed with `index` in 
DataFrame creation

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/frame.py| 177 +++
 python/pyspark/pandas/tests/test_dataframe.py | 239 +-
 2 files changed, 377 insertions(+), 39 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed

[spark] branch master updated: [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed

[spark] branch master updated: [SPARK-40271][PYTHON] Support list type for `pyspark.sql.functions.lit`

[spark] branch master updated (296fe49ec85 -> c2d4c4ed4c5)

[spark] branch master updated: [SPARK-40260][SQL] Use error classes in the compilation errors of GROUP BY a position

[spark] branch master updated (64dd81d97da -> cee30cb4994)

[spark] branch master updated (c5e93eb6631 -> 64dd81d97da)

[spark] branch master updated: [SPARK-40207][SQL] Specify the column name when the data type is not supported by datasource

[spark] branch master updated: [SPARK-40156][SQL][TESTS][FOLLOW-UP] `url_decode()` should the return an error class

[spark] branch master updated (6deb8aa386a -> 44213742315)

[spark] branch master updated: [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib

[spark] branch branch-3.2 updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

[spark] branch branch-3.3 updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

[spark] branch master updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style

[spark] branch master updated: [SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part

[spark] branch master updated (ca17135b288 -> e0cb2eb104e)

16 matches

Site Navigation

Mail list logo

Footer information