[PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset for branch 3.5 [spark]

via GitHub Wed, 04 Oct 2023 03:33:26 -0700


eejbyfeldt opened a new pull request, #43213:
URL: https://github.com/apache/spark/pull/43213

<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in

'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the
guideline first in
'core/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?
Support for InMememoryTableScanExec in AQE was added in #39624, but this
patch contained a bug when a Dataset is persisted using `StorageLevel.NONE`.
Before that patch a query like:
```
import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()
```
would correctly return 2. But after that patch it incorrectly returns 0.
This is because AQE incorrectly determines based on the runtime statistics that
are collected here:

https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala#L294
that the input is empty. The problem is that the action that should make
sure the statistics are collected here

https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L285-L291
never use the iterator and when we have `StorageLevel.NONE` the persisting
will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with
StorageLevel.NONE a no-op. Changing the action since it always "emptied" the
iterator would also work but seems like that would be unnecessary work in a lot
of normal circumstances.

### Why are the changes needed?
The current code has a correctness issue.

### Does this PR introduce _any_ user-facing change?
Yes, fixes the correctness issue.

### How was this patch tested?
New and existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset for branch 3.5 [spark]

Reply via email to