[spark] branch master updated: [SPARK-35845][SQL] OuterReference resolution should reject ambiguous column names

2021-06-22 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 20edfdd  [SPARK-35845][SQL] OuterReference resolution should reject 
ambiguous column names
20edfdd is described below

commit 20edfdd39a83c52813f91e4028f816d06a6be99e
Author: Wenchen Fan 
AuthorDate: Wed Jun 23 14:32:34 2021 +0800

[SPARK-35845][SQL] OuterReference resolution should reject ambiguous column 
names

### What changes were proposed in this pull request?

The current OuterReference resolution is a bit weird: when the outer plan 
has more than one child, it resolves OuterReference from the output of each 
child, one by one, left to right.

This is incorrect in the case of join, as the column name can be ambiguous 
if both left and right sides output this column.

This PR fixes this bug by resolving OuterReference with 
`outerPlan.resolveChildren`, instead of something like 
`outerPlan.children.foreach(_.resolve(...))`

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

The problem only occurs in join, and join condition doesn't support 
correlated subquery yet. So this PR only improves the error message. Before 
this PR, people see
```
java.lang.UnsupportedOperationException
Cannot generate code for expression: outer(t1a#291)
```

### How was this patch tested?

a new test

Closes #33004 from cloud-fan/outer-ref.

Authored-by: Wenchen Fan 
Signed-off-by: Gengliang Wang 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala | 35 +++---
 .../catalyst/optimizer/DecorrelateInnerQuery.scala | 10 ++-
 .../spark/sql/catalyst/optimizer/subquery.scala| 26 
 .../optimizer/DecorrelateInnerQuerySuite.scala |  6 ++--
 .../negative-cases/invalid-correlation.sql |  9 ++
 .../negative-cases/invalid-correlation.sql.out | 24 ++-
 6 files changed, 68 insertions(+), 42 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 555be01..ba680ba 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -2285,8 +2285,8 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 }
 
 /**
- * Resolve the correlated expressions in a subquery by using the an outer 
plans' references. All
- * resolved outer references are wrapped in an [[OuterReference]]
+ * Resolve the correlated expressions in a subquery, as if the expressions 
live in the outer
+ * plan. All resolved outer references are wrapped in an [[OuterReference]]
  */
 private def resolveOuterReferences(plan: LogicalPlan, outer: LogicalPlan): 
LogicalPlan = {
   
plan.resolveOperatorsDownWithPruning(_.containsPattern(UNRESOLVED_ATTRIBUTE)) {
@@ -2295,7 +2295,7 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 case u @ UnresolvedAttribute(nameParts) =>
   withPosition(u) {
 try {
-  outer.resolve(nameParts, resolver) match {
+  outer.resolveChildren(nameParts, resolver) match {
 case Some(outerAttr) => wrapOuterReference(outerAttr)
 case None => u
   }
@@ -2317,7 +2317,7 @@ class Analyzer(override val catalogManager: 
CatalogManager)
  */
 private def resolveSubQuery(
 e: SubqueryExpression,
-plans: Seq[LogicalPlan])(
+outer: LogicalPlan)(
 f: (LogicalPlan, Seq[Expression]) => SubqueryExpression): 
SubqueryExpression = {
   // Step 1: Resolve the outer expressions.
   var previous: LogicalPlan = null
@@ -2328,10 +2328,8 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 current = executeSameContext(current)
 
 // Use the outer references to resolve the subquery plan if it isn't 
resolved yet.
-val i = plans.iterator
-val afterResolve = current
-while (!current.resolved && current.fastEquals(afterResolve) && 
i.hasNext) {
-  current = resolveOuterReferences(current, i.next())
+if (!current.resolved) {
+  current = resolveOuterReferences(current, outer)
 }
   } while (!current.resolved && !current.fastEquals(previous))
 
@@ -2354,20 +2352,20 @@ class Analyzer(override val catalogManager: 
CatalogManager)
  * (2) Any aggregate expression(s) that reference outer attributes are 
pushed down to
  * outer plan to get evaluated.
  */
-private def resolveSubQueries(plan: LogicalPlan, pl

[spark] branch master updated: [SPARK-35772][SQL][TESTS] Check all year-month interval types in `HiveInspectors` tests

2021-06-22 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new df55945  [SPARK-35772][SQL][TESTS] Check all year-month interval types 
in `HiveInspectors` tests
df55945 is described below

commit df55945804918f4d147dcef7a9b5f18bff4cabc9
Author: Angerszh 
AuthorDate: Wed Jun 23 08:54:07 2021 +0300

[SPARK-35772][SQL][TESTS] Check all year-month interval types in 
`HiveInspectors` tests

### What changes were proposed in this pull request?
Check all year-month interval types in HiveInspectors tests.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT.

Closes #32970 from AngersZh/SPARK-35772.

Authored-by: Angerszh 
Signed-off-by: Max Gekk 
---
 .../execution/HiveScriptTransformationSuite.scala  | 46 --
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
index 8cea781..d84a766 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
@@ -32,6 +32,7 @@ import org.apache.spark.sql.execution._
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.hive.test.TestHiveSingleton
 import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.YearMonthIntervalType._
 import org.apache.spark.unsafe.types.CalendarInterval
 
 class HiveScriptTransformationSuite extends BaseScriptTransformationSuite with 
TestHiveSingleton {
@@ -521,22 +522,20 @@ class HiveScriptTransformationSuite extends 
BaseScriptTransformationSuite with T
 
   }
 
-  test("SPARK-34879: HiveInspectors supports DayTimeIntervalType and 
YearMonthIntervalType") {
+  test("SPARK-34879: HiveInspectors supports DayTimeIntervalType") {
 assume(TestUtils.testCommandAvailable("/bin/bash"))
 withTempView("v") {
   val df = Seq(
 (Duration.ofDays(1),
   Duration.ofSeconds(100).plusNanos(123456),
-  Duration.of(Long.MaxValue, ChronoUnit.MICROS),
-  Period.ofMonths(10)),
+  Duration.of(Long.MaxValue, ChronoUnit.MICROS)),
 (Duration.ofDays(1),
   Duration.ofSeconds(100).plusNanos(1123456789),
-  Duration.ofSeconds(Long.MaxValue / 
DateTimeConstants.MICROS_PER_SECOND),
-  Period.ofMonths(10))
-  ).toDF("a", "b", "c", "d")
+  Duration.ofSeconds(Long.MaxValue / 
DateTimeConstants.MICROS_PER_SECOND))
+  ).toDF("a", "b", "c")
   df.createTempView("v")
 
-  // Hive serde supports DayTimeIntervalType/YearMonthIntervalType as 
input and output data type
+  // Hive serde supports DayTimeIntervalType as input and output data type
   checkAnswer(
 df,
 (child: SparkPlan) => createScriptTransformationExec(
@@ -545,12 +544,37 @@ class HiveScriptTransformationSuite extends 
BaseScriptTransformationSuite with T
 // TODO(SPARK-35733): Check all day-time interval types in 
HiveInspectors tests
 AttributeReference("a", DayTimeIntervalType())(),
 AttributeReference("b", DayTimeIntervalType())(),
-AttributeReference("c", DayTimeIntervalType())(),
-// TODO(SPARK-35772): Check all year-month interval types in 
HiveInspectors tests
-AttributeReference("d", YearMonthIntervalType())()),
+AttributeReference("c", DayTimeIntervalType())()),
+  child = child,
+  ioschema = hiveIOSchema),
+df.select($"a", $"b", $"c").collect())
+}
+  }
+
+  test("SPARK-35722: HiveInspectors supports all type of 
YearMonthIntervalType") {
+assume(TestUtils.testCommandAvailable("/bin/bash"))
+withTempView("v") {
+  val schema = StructType(Seq(
+StructField("a", YearMonthIntervalType(YEAR)),
+StructField("b", YearMonthIntervalType(YEAR, MONTH)),
+StructField("c", YearMonthIntervalType(MONTH))
+  ))
+  val df = spark.createDataFrame(sparkContext.parallelize(Seq(
+Row(Period.ofMonths(13), Period.ofMonths(13), Period.ofMonths(13))
+  )), schema)
+
+  // Hive serde supports YearMonthIntervalType as input and output data 
type
+  checkAnswer(
+df,
+(child: SparkPlan) => createScriptTransformationExec(
+  script = "cat",
+  output = Seq(
+AttributeReference("a", YearMonthIntervalType(YEAR))(),
+AttributeReference("b", YearMonthIntervalType(YEAR, MONTH))(),
+AttributeReference("c"

[spark] branch master updated (960a7e5 -> 4416b4b)

2021-06-22 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 960a7e5  [SPARK-35856][SQL][TESTS] Move new interval type test cases 
from CastSuite to CastBaseSuite
 add 4416b4b  [SPARK-35734][SQL][FOLLOWUP] 
IntervalUtils.toDayTimeIntervalString should consider the case a day-time type 
is casted as another day-time type

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/util/IntervalUtils.scala| 148 +++--
 .../sql/catalyst/util/IntervalUtilsSuite.scala |  58 +---
 2 files changed, 119 insertions(+), 87 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (a87ee5d -> 960a7e5)

2021-06-22 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a87ee5d  [SPARK-35695][SQL][FOLLOWUP] Use AQE helper to simplify the 
code in CollectMetricsExec
 add 960a7e5  [SPARK-35856][SQL][TESTS] Move new interval type test cases 
from CastSuite to CastBaseSuite

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/AnsiCastSuiteBase.scala   |   8 ++
 .../spark/sql/catalyst/expressions/CastSuite.scala | 124 +--
 .../sql/catalyst/expressions/CastSuiteBase.scala   | 133 -
 .../sql/catalyst/expressions/TryCastSuite.scala|   2 +-
 4 files changed, 142 insertions(+), 125 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated (92bf3f3 -> ea28301)

2021-06-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 92bf3f3  [SPARK-35796][TESTS] Fix SparkSubmitSuite failure on MacOS 
10.15+
 add ea28301  [SPARK-35695][SQL][FOLLOWUP] Use AQE helper to simplify the 
code in CollectMetricsExec

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/execution/CollectMetricsExec.scala | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated: [SPARK-35695][SQL][FOLLOWUP] Use AQE helper to simplify the code in CollectMetricsExec

2021-06-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new d1b51d7  [SPARK-35695][SQL][FOLLOWUP] Use AQE helper to simplify the 
code in CollectMetricsExec
d1b51d7 is described below

commit d1b51d7a0b126ec4657986b0b5bea069dfbf0887
Author: Wenchen Fan 
AuthorDate: Wed Jun 23 09:54:12 2021 +0900

[SPARK-35695][SQL][FOLLOWUP] Use AQE helper to simplify the code in 
CollectMetricsExec

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/32862 , to 
simplify the code with AQE helper.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33026 from cloud-fan/follow.

Authored-by: Wenchen Fan 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit a87ee5d8b9dc00e327edc9911c21225c09042acd)
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/sql/execution/CollectMetricsExec.scala | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/CollectMetricsExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/CollectMetricsExec.scala
index 7a59504..c9e1bc3 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/CollectMetricsExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/CollectMetricsExec.scala
@@ -22,7 +22,7 @@ import org.apache.spark.sql.Row
 import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
 import org.apache.spark.sql.catalyst.expressions.{Attribute, NamedExpression, 
SortOrder}
 import org.apache.spark.sql.catalyst.plans.physical.Partitioning
-import org.apache.spark.sql.execution.adaptive.{AdaptiveSparkPlanExec, 
QueryStageExec}
+import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper
 import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
 import org.apache.spark.sql.types.StructType
 
@@ -86,19 +86,16 @@ case class CollectMetricsExec(
   }
 }
 
-object CollectMetricsExec {
+object CollectMetricsExec extends AdaptiveSparkPlanHelper {
   /**
* Recursively collect all collected metrics from a query tree.
*/
   def collect(plan: SparkPlan): Map[String, Row] = {
-val metrics = plan.collectWithSubqueries {
-  case collector: CollectMetricsExec => Map(collector.name -> 
collector.collectedMetrics)
+val metrics = collectWithSubqueries(plan) {
+  case collector: CollectMetricsExec =>
+Map(collector.name -> collector.collectedMetrics)
   case tableScan: InMemoryTableScanExec =>
 CollectMetricsExec.collect(tableScan.relation.cachedPlan)
-  case adaptivePlan: AdaptiveSparkPlanExec =>
-CollectMetricsExec.collect(adaptivePlan.executedPlan)
-  case queryStageExec: QueryStageExec =>
-CollectMetricsExec.collect(queryStageExec.plan)
 }
 metrics.reduceOption(_ ++ _).getOrElse(Map.empty)
   }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (68b54b7 -> a87ee5d)

2021-06-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 68b54b7  [SPARK-35473][PYTHON] Fix disallow_untyped_defs mypy checks 
for pyspark.pandas.groupby
 add a87ee5d  [SPARK-35695][SQL][FOLLOWUP] Use AQE helper to simplify the 
code in CollectMetricsExec

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/execution/CollectMetricsExec.scala | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (7a21e9c -> 68b54b7)

2021-06-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7a21e9c  [SPARK-35858][SQL] SparkPlan.makeCopy should not set the 
active session
 add 68b54b7  [SPARK-35473][PYTHON] Fix disallow_untyped_defs mypy checks 
for pyspark.pandas.groupby

No new revisions were added by this update.

Summary of changes:
 python/mypy.ini  |   3 -
 python/pyspark/pandas/base.py|   6 +-
 python/pyspark/pandas/frame.py   |   8 ++
 python/pyspark/pandas/generic.py |  23 ++--
 python/pyspark/pandas/groupby.py | 270 ++-
 python/pyspark/pandas/series.py  |  23 +++-
 6 files changed, 198 insertions(+), 135 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (a2c1a55 -> 7a21e9c)

2021-06-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a2c1a55  [SPARK-35700][SQL][FOLLOWUP] Read schema from ORC files 
should strip CHAR/VARCHAR types
 add 7a21e9c  [SPARK-35858][SQL] SparkPlan.makeCopy should not set the 
active session

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala| 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated: [SPARK-35700][SQL][FOLLOWUP] Read schema from ORC files should strip CHAR/VARCHAR types

2021-06-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new fd7df35  [SPARK-35700][SQL][FOLLOWUP] Read schema from ORC files 
should strip CHAR/VARCHAR types
fd7df35 is described below

commit fd7df35f7df4cfe55fecfa44fed70fe967791d9b
Author: Wenchen Fan 
AuthorDate: Tue Jun 22 13:50:49 2021 -0700

[SPARK-35700][SQL][FOLLOWUP] Read schema from ORC files should strip 
CHAR/VARCHAR types

This is a followup of https://github.com/apache/spark/pull/33001 , to 
provide a more direct fix.

The regression in 3.1 was caused by the fact that we changed the parser and 
allow the parser to return CHAR/VARCHAR type. We should have replaced 
CHAR/VARCHAR with STRING before the data type flows into the query engine, 
however, `OrcUtils` is missed.

When reading ORC files, at the task side we will read the real schema from 
ORC file metadata, then apply filter pushdown. For some reason, the 
implementation turns ORC schema to Spark schema before filter pushdown, and 
this step does not strip CHAR/VARCHAR. Note, for Parquet we use the Parquet 
schema directly in filter pushdown, and do not this have problem.

This PR proposes to replace the CHAR/VARCHAR with STRING when turning ORC 
schema to Spark schema.

a more directly bug fix

no

existing tests

Closes #33030 from cloud-fan/help.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit a2c1a55b1fed5d552f6bc355ba3c542dfeee5a91)
Signed-off-by: Dongjoon Hyun 
---
 .../sql/execution/datasources/orc/OrcFilters.scala  |  2 +-
 .../spark/sql/execution/datasources/orc/OrcUtils.scala  | 17 +++--
 .../org/apache/spark/sql/HiveCharVarcharTestSuite.scala |  1 -
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
index f54cff2..9511fc3 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
@@ -143,7 +143,7 @@ private[sql] object OrcFilters extends OrcFiltersBase {
 case BooleanType => PredicateLeaf.Type.BOOLEAN
 case ByteType | ShortType | IntegerType | LongType => 
PredicateLeaf.Type.LONG
 case FloatType | DoubleType => PredicateLeaf.Type.FLOAT
-case StringType | _: CharType | _: VarcharType => PredicateLeaf.Type.STRING
+case StringType => PredicateLeaf.Type.STRING
 case DateType => PredicateLeaf.Type.DATE
 case TimestampType => PredicateLeaf.Type.TIMESTAMP
 case _: DecimalType => PredicateLeaf.Type.DECIMAL
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
index 623f4f7..2190f7e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
@@ -32,7 +32,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.sql.{SPARK_VERSION_METADATA_KEY, SparkSession}
 import org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
-import org.apache.spark.sql.catalyst.util.quoteIdentifier
+import org.apache.spark.sql.catalyst.util.{quoteIdentifier, CharVarcharUtils}
 import org.apache.spark.sql.execution.datasources.SchemaMergeUtils
 import org.apache.spark.sql.types._
 import org.apache.spark.util.{ThreadUtils, Utils}
@@ -81,6 +81,13 @@ object OrcUtils extends Logging {
 }
   }
 
+  private def toCatalystSchema(schema: TypeDescription): StructType = {
+// The Spark query engine has not completely supported CHAR/VARCHAR type 
yet, and here we
+// replace the orc CHAR/VARCHAR with STRING type.
+CharVarcharUtils.replaceCharVarcharWithStringInSchema(
+  
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
+  }
+
   def readSchema(sparkSession: SparkSession, files: Seq[FileStatus], options: 
Map[String, String])
   : Option[StructType] = {
 val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles
@@ -88,7 +95,7 @@ object OrcUtils extends Logging {
 files.toIterator.map(file => readSchema(file.getPath, conf, 
ignoreCorruptFiles)).collectFirst {
   case Some(schema) =>
 logDebug(s"Reading schema from file $files, got Hive schema string: 
$schema")
-
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
+toCatalystSchema(schema)
 }
   }
 
@@ -

[spark] branch master updated (c418803 -> a2c1a55)

2021-06-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c418803  [SPARK-35847][PYTHON] Manage InternalField in 
DataTypeOps.isnull
 add a2c1a55  [SPARK-35700][SQL][FOLLOWUP] Read schema from ORC files 
should strip CHAR/VARCHAR types

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/orc/OrcFilters.scala  |  2 +-
 .../spark/sql/execution/datasources/orc/OrcUtils.scala  | 17 +++--
 .../org/apache/spark/sql/HiveCharVarcharTestSuite.scala |  1 -
 3 files changed, 12 insertions(+), 8 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-35847][PYTHON] Manage InternalField in DataTypeOps.isnull

2021-06-22 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c418803  [SPARK-35847][PYTHON] Manage InternalField in 
DataTypeOps.isnull
c418803 is described below

commit c418803df7723d3bebce7792774d2b761a83be40
Author: Takuya UESHIN 
AuthorDate: Tue Jun 22 12:54:01 2021 -0700

[SPARK-35847][PYTHON] Manage InternalField in DataTypeOps.isnull

### What changes were proposed in this pull request?

Properly set `InternalField` for `DataTypeOps.isnull`.

### Why are the changes needed?

The result of `DataTypeOps.isnull` must always be non-nullable boolean.
We should manage `InternalField` for this case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some more tests.

Closes #33005 from ueshin/issues/SPARK-35847/isnull_field.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/data_type_ops/base.py|  7 +++-
 python/pyspark/pandas/data_type_ops/num_ops.py | 12 +-
 .../pandas/tests/data_type_ops/test_num_ops.py | 46 +-
 .../pandas/tests/data_type_ops/test_string_ops.py  |  3 ++
 4 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/base.py 
b/python/pyspark/pandas/data_type_ops/base.py
index 0a411ec..3eef0db 100644
--- a/python/pyspark/pandas/data_type_ops/base.py
+++ b/python/pyspark/pandas/data_type_ops/base.py
@@ -333,7 +333,12 @@ class DataTypeOps(object, metaclass=ABCMeta):
 return col.replace({np.nan: None})
 
 def isnull(self, index_ops: T_IndexOps) -> T_IndexOps:
-return index_ops._with_new_scol(index_ops.spark.column.isNull())
+return index_ops._with_new_scol(
+index_ops.spark.column.isNull(),
+field=index_ops._internal.data_fields[0].copy(
+dtype=np.dtype("bool"), spark_type=BooleanType(), 
nullable=False
+),
+)
 
 def astype(self, index_ops: T_IndexOps, dtype: Union[str, type, Dtype]) -> 
T_IndexOps:
 raise TypeError("astype can not be applied to %s." % self.pretty_name)
diff --git a/python/pyspark/pandas/data_type_ops/num_ops.py 
b/python/pyspark/pandas/data_type_ops/num_ops.py
index 851ee2a..7f7a17f 100644
--- a/python/pyspark/pandas/data_type_ops/num_ops.py
+++ b/python/pyspark/pandas/data_type_ops/num_ops.py
@@ -367,7 +367,10 @@ class FractionalOps(NumericOps):
 
 def isnull(self, index_ops: T_IndexOps) -> T_IndexOps:
 return index_ops._with_new_scol(
-index_ops.spark.column.isNull() | F.isnan(index_ops.spark.column)
+index_ops.spark.column.isNull() | F.isnan(index_ops.spark.column),
+field=index_ops._internal.data_fields[0].copy(
+dtype=np.dtype("bool"), spark_type=BooleanType(), 
nullable=False
+),
 )
 
 def astype(self, index_ops: T_IndexOps, dtype: Union[str, type, Dtype]) -> 
T_IndexOps:
@@ -404,7 +407,12 @@ class DecimalOps(FractionalOps):
 return "decimal"
 
 def isnull(self, index_ops: T_IndexOps) -> T_IndexOps:
-return index_ops._with_new_scol(index_ops.spark.column.isNull())
+return index_ops._with_new_scol(
+index_ops.spark.column.isNull(),
+field=index_ops._internal.data_fields[0].copy(
+dtype=np.dtype("bool"), spark_type=BooleanType(), 
nullable=False
+),
+)
 
 def astype(self, index_ops: T_IndexOps, dtype: Union[str, type, Dtype]) -> 
T_IndexOps:
 dtype, spark_type = pandas_on_spark_type(dtype)
diff --git a/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py 
b/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
index e4e1eecd..b8b579b 100644
--- a/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
+++ b/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
@@ -320,29 +320,55 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils):
 
 @unittest.skipIf(not extension_dtypes_available, "pandas extension dtypes are 
not available")
 class IntegralExtensionOpsTest(PandasOnSparkTestCase, TestCasesUtils):
-def test_from_to_pandas(self):
-data = [1, 2, 3, None]
+@property
+def intergral_extension_psers(self):
 dtypes = ["Int8", "Int16", "Int32", "Int64"]
-for dtype in dtypes:
-pser = pd.Series(data, dtype=dtype)
-psser = ps.Series(data, dtype=dtype)
+return [pd.Series([1, 2, 3, None], dtype=dtype) for dtype in dtypes]
+
+@property
+def intergral_extension_pssers(self):
+return [ps.from_pandas(pser) for pser in 
self.intergral_extension_psers]
+
+@property
+def intergral_extension_pser_psser_pairs(self):
+return zip(self.intergral_extension_p

[spark] branch master updated: [SPARK-35800][SS] Improving GroupState testability by introducing TestGroupState

2021-06-22 Thread tdas
This is an automated email from the ASF dual-hosted git repository.

tdas pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dfd7b02  [SPARK-35800][SS] Improving GroupState testability by 
introducing TestGroupState
dfd7b02 is described below

commit dfd7b026dc7c3c38bef9afab82852aff902a25d2
Author: Li Zhang 
AuthorDate: Tue Jun 22 15:04:01 2021 -0400

[SPARK-35800][SS] Improving GroupState testability by introducing 
TestGroupState

### What changes were proposed in this pull request?
Proposed changes in this pull request:

1. Introducing the `TestGroupState` interface which is inherited from 
`GroupState` so that testing related getters can be exposed in a controlled 
manner
2. Changing `GroupStateImpl` to inherit from `TestGroupState` interface, 
instead of directly from `GroupState`
3. Implementing `TestGroupState` object with `create()` method to forward 
inputs to the private `GroupStateImpl` constructor
4. User input validations have been added into `GroupStateImpl`'s 
`createForStreaming()` method to prevent users from creating invalid GroupState 
objects.
5. Replacing existing `GroupStateImpl` usages in sql pkg internal unit 
tests with the newly added `TestGroupState` to give user best practice about 
`TestGroupState` usage.

With the changes in this PR, the class hierarchy is changed from 
`GroupStateImpl` -> `GroupState` to `GroupStateImpl` -> `TestGroupState` -> 
`GroupState` (-> means inherits from)

### Why are the changes needed?
The internal `GroupStateImpl` implementation for the `GroupState` interface 
has no public constructors accessible outside of the sql pkg. However, the 
user-provided state transition function for `[map|flatMap]GroupsWithState` 
requires a `GroupState` object as the prevState input.

Currently, users are calling the Structured Streaming engine in their unit 
tests in order to instantiate such `GroupState` instances, which makes UTs 
cumbersome.

The proposed `TestGroupState` interface is to give users controlled access 
to the `GroupStateImpl` internal implementation to largely improve testability 
of Structured Streaming state transition functions.

**Usage Example**
```
import org.apache.spark.sql.streaming.TestGroupState

test(“Structured Streaming state update function”) {
  var prevState = TestGroupState.create[UserStatus](
optionalState = Optional.empty[UserStatus],
timeoutConf = EventTimeTimeout,
batchProcessingTimeMs = 1L,
eventTimeWatermarkMs = Optional.of(1L),
hasTimedOut = false)

  val userId: String = ...
  val actions: Iterator[UserAction] = ...

  assert(!prevState.hasUpdated)

  updateState(userId, actions, prevState)

  assert(prevState.hasUpdated)
}
```

### Does this PR introduce _any_ user-facing change?
Yes, the `TestGroupState` interface and its corresponding `create()` 
factory function in its companion object are introduced in this pull request 
for users to use in unit tests.

### How was this patch tested?
- New unit tests are added
- Existing GroupState unit tests are updated

Closes #32938 from lizhangdatabricks/improve-group-state-testability.

Authored-by: Li Zhang 
Signed-off-by: Tathagata Das 
---
 .../streaming/FlatMapGroupsWithStateExec.scala |   8 +-
 .../sql/execution/streaming/GroupStateImpl.scala   |  36 ++-
 .../state/FlatMapGroupsWithStateExecHelper.scala   |   5 +-
 .../spark/sql/streaming/TestGroupState.scala   | 173 ++
 .../org/apache/spark/sql/JavaDatasetSuite.java |  92 
 .../streaming/FlatMapGroupsWithStateSuite.scala| 255 +++--
 6 files changed, 475 insertions(+), 94 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala
index e626fc1..981586e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala
@@ -60,8 +60,8 @@ case class FlatMapGroupsWithStateExec(
 child: SparkPlan
   ) extends UnaryExecNode with ObjectProducerExec with StateStoreWriter with 
WatermarkSupport {
 
-  import GroupStateImpl._
   import FlatMapGroupsWithStateExecHelper._
+  import GroupStateImpl._
 
   private val isTimeoutEnabled = timeoutConf != NoTimeout
   private val watermarkPresent = child.output.exists {
@@ -229,13 +229,13 @@ case class FlatMapGroupsWithStateExec(
 
   // When the iterator is consumed, then write changes to state
   def onIteratorCompletion: Unit = {
-if (groupState.hasRemove

[spark] branch master updated (2704658 -> 1c26433)

2021-06-22 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 2704658  [SPARK-35645][PYTHON][DOCS] Merge contents and remove 
obsolete pages in Getting Started section
 add 1c26433  [SPARK-35849][PYTHON] Make `astype` method data-type-based 
for DecimalOps

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/data_type_ops/num_ops.py | 27 +++---
 1 file changed, 16 insertions(+), 11 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section

2021-06-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2704658  [SPARK-35645][PYTHON][DOCS] Merge contents and remove 
obsolete pages in Getting Started section
2704658 is described below

commit 27046582e48cd4eed9955fd3c26b29423770976c
Author: Hyukjin Kwon 
AuthorDate: Tue Jun 22 09:36:27 2021 -0700

[SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in 
Getting Started section

### What changes were proposed in this pull request?

This PR revise the installation to describe `pip install 
pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and 
videos/blogposts.

### Why are the changes needed?

pandas-on-Spark installation is merged to PySpark installation pages. For 
videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts 
and videos are obsolete.

### Does this PR introduce _any_ user-facing change?

To end users, no because the docs are not released yet.

### How was this patch tested?

I manually built the docs and checked the output

Closes #33018 from HyukjinKwon/SPARK-35645.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 python/docs/source/getting_started/index.rst   |  10 +-
 python/docs/source/getting_started/install.rst |   3 +
 python/docs/source/getting_started/ps_install.rst  | 145 -
 .../source/getting_started/ps_videos_blogs.rst | 130 --
 4 files changed, 5 insertions(+), 283 deletions(-)

diff --git a/python/docs/source/getting_started/index.rst 
b/python/docs/source/getting_started/index.rst
index ed23f5a..595d682 100644
--- a/python/docs/source/getting_started/index.rst
+++ b/python/docs/source/getting_started/index.rst
@@ -25,18 +25,12 @@ There are more guides shared with other languages such as
 `Quick Start `_ in 
Programming Guides
 at `the Spark documentation 
`_.
 
+.. TODO(SPARK-35588): Merge PySpark quickstart and 10 minutes to pandas API on 
Spark.
+
 .. toctree::
:maxdepth: 2
 
install
quickstart
-
-For pandas API on Spark:
-
-.. toctree::
-   :maxdepth: 2
-
-   ps_install
ps_10mins
-   ps_videos_blogs
 
diff --git a/python/docs/source/getting_started/install.rst 
b/python/docs/source/getting_started/install.rst
index 3d51893..df54310 100644
--- a/python/docs/source/getting_started/install.rst
+++ b/python/docs/source/getting_started/install.rst
@@ -46,7 +46,10 @@ If you want to install extra dependencies for a specific 
component, you can inst
 
 .. code-block:: bash
 
+# Spark SQL
 pip install pyspark[sql]
+# pandas API on Spark
+pip install pyspark[pandas_on_spark]
 
 For PySpark with/without a specific Hadoop version, you can install it by 
using ``PYSPARK_HADOOP_VERSION`` environment variables as below:
 
diff --git a/python/docs/source/getting_started/ps_install.rst 
b/python/docs/source/getting_started/ps_install.rst
deleted file mode 100644
index 974895a..000
--- a/python/docs/source/getting_started/ps_install.rst
+++ /dev/null
@@ -1,145 +0,0 @@
-
-Installation
-
-
-Pandas API on Spark requires PySpark so please make sure your PySpark is 
available.
-
-To install pandas API on Spark, you can use:
-
-- `Conda `__
-- `PyPI `__
-- `Installation from source 
<../development/ps_contributing.rst#environment-setup>`__
-
-To install PySpark, you can use:
-
-- `Installation with the official release channel 
`__
-- `Conda `__
-- `PyPI `__
-- `Installation from source `__
-
-
-Python version support
---
-
-Officially Python 3.5 to 3.8.
-
-.. note::
-   Python 3.5 support is deprecated and will be dropped in the future release.
-   At that point, existing Python 3.5 workflows that use pandas API on Spark 
will continue to work without
-   modification, but Python 3.5 users will no longer get access to the latest 
pandas-on-Spark features
-   and bugfixes. We recommend that you upgrade to Python 3.6 or newer.
-
-Installing pandas API on Spark

-
-Installing with Conda
-~~
-
-First you will need `Conda `__ to be installed.
-After that, we should create a new conda environment. A conda environment is 
similar with a
-virtualenv that allows you to specify a specific version of Python and set of 
libraries.
-Run the following commands from a terminal window::
-
-cond

[spark] branch master updated (bc61b62 -> ce53b71)

2021-06-22 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bc61b62  [SPARK-35727][SQL] Return INTERVAL DAY from dates subtraction
 add ce53b71  [SPARK-35854][SQL] Improve the error message of 
to_timestamp_ntz with invalid format pattern

No new revisions were added by this update.

Summary of changes:
 .../catalyst/util/DateTimeFormatterHelper.scala|  7 +-
 .../sql/catalyst/util/TimestampFormatter.scala | 29 --
 .../spark/sql/errors/QueryExecutionErrors.scala| 10 ++--
 .../sql-tests/results/ansi/datetime.sql.out| 12 -
 .../sql-tests/results/datetime-legacy.sql.out  | 12 -
 .../resources/sql-tests/results/datetime.sql.out   | 12 -
 6 files changed, 53 insertions(+), 29 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (6c05459 -> bc61b62)

2021-06-22 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6c05459  [SPARK-35838][BUILD][TESTS] Ensure all modules can be maven 
test independently in Scala 2.13
 add bc61b62  [SPARK-35727][SQL] Return INTERVAL DAY from dates subtraction

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/expressions/datetimeExpressions.scala  | 4 ++--
 sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out  | 4 ++--
 sql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out  | 2 ++
 .../src/test/resources/sql-tests/results/datetime-legacy.sql.out | 4 ++--
 sql/core/src/test/resources/sql-tests/results/datetime.sql.out   | 4 ++--
 sql/core/src/test/resources/sql-tests/results/interval.sql.out   | 1 +
 .../sql-tests/results/typeCoercion/native/promoteStrings.sql.out | 2 +-
 .../src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala  | 5 -
 8 files changed, 16 insertions(+), 10 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (5a510cf -> 6c05459)

2021-06-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 5a510cf  [SPARK-35726][SPARK-35769][SQL][FOLLOWUP] Call periodToMonths 
and durationToMicros in HiveResult should add endField
 add 6c05459  [SPARK-35838][BUILD][TESTS] Ensure all modules can be maven 
test independently in Scala 2.13

No new revisions were added by this update.

Summary of changes:
 external/avro/pom.xml   | 11 +++
 external/kafka-0-10-sql/pom.xml | 12 +++-
 sql/hive-thriftserver/pom.xml   | 11 +++
 3 files changed, 33 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated (cd68b92 -> 2c20b49)

2021-06-22 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a change to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git.


from cd68b92  [SPARK-35700][SQL] Read char/varchar orc table with created 
and written by external systems
 add 2c20b49  [SPARK-35799][SS] Fix the allUpdatesTimeMs metric measuring 
in FlatMapGroupsWithStateExec

No new revisions were added by this update.

Summary of changes:
 .../streaming/FlatMapGroupsWithStateExec.scala | 35 --
 1 file changed, 19 insertions(+), 16 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (43cd6ca -> 5a510cf)

2021-06-22 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 43cd6ca  [SPARK-35378][SQL][FOLLOWUP] isLocal should consider 
CommandResult
 add 5a510cf  [SPARK-35726][SPARK-35769][SQL][FOLLOWUP] Call periodToMonths 
and durationToMicros in HiveResult should add endField

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/execution/HiveResult.scala| 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (d4d11cf -> 43cd6ca)

2021-06-22 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d4d11cf  [SPARK-35799][SS] Fix the allUpdatesTimeMs metric measuring 
in FlatMapGroupsWithStateExec
 add 43cd6ca  [SPARK-35378][SQL][FOLLOWUP] isLocal should consider 
CommandResult

No new revisions were added by this update.

Summary of changes:
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala| 3 ++-
 sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 7 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org