This is an automated email from the ASF dual-hosted git repository.
wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.3 by this push:
new 058dcbf3fb0 [SPARK-43240][SQL][3.3] Fix the wrong result issue when
calling df.describe() method
058dcbf3fb0 is described below
commit 058dcbf3fb0b17a4295f6e0b516f5c955cfa2d59
Author: Jia Ke <[email protected]>
AuthorDate: Wed Apr 26 17:24:46 2023 +0800
[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling
df.describe() method
### What changes were proposed in this pull request?
The df.describe() method will cached the RDD. And if the cached RDD is
RDD[Unsaferow], which may be released after the row is used, then the result
will be wong. Here we need to copy the RDD before caching as the
[TakeOrderedAndProjectExec
](https://github.com/apache/spark/blob/d68d46c9e2cec04541e2457f4778117b570d8cdb/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L204)operator
does.
### Why are the changes needed?
bug fix
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Closes #40914 from JkSelf/describe.
Authored-by: Jia Ke <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
---
.../main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
index 9155c1cb6e7..ff6c08cea00 100644
---
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+++
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
@@ -288,7 +288,7 @@ object StatFunctions extends Logging {
}
// If there is no selected columns, we don't need to run this aggregate,
so make it a lazy val.
- lazy val aggResult = ds.select(aggExprs:
_*).queryExecution.toRdd.collect().head
+ lazy val aggResult = ds.select(aggExprs:
_*).queryExecution.toRdd.map(_.copy()).collect().head
// We will have one row for each selected statistic in the result.
val result = Array.fill[InternalRow](selectedStatistics.length) {
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]