[PR] [SPARK-47252][DOCS] Clarify that pivot may trigger an eager computation [spark]

via GitHub Sat, 02 Mar 2024 14:48:22 -0800


nchammas opened a new pull request, #45363:
URL: https://github.com/apache/spark/pull/45363

### What changes were proposed in this pull request?

Clarify that, if explicit pivot values are not provided, Spark will eagerly
compute them.

### Why are the changes needed?

The current wording on `master` is misleading. To say that one version of
pivot is more or less "efficient" than the other glosses over the fact that one
is lazy and the other is not. Spark users are trained from early on that
transformations are generally lazy; exceptions to this rule should be more
clearly highlighted.

I experienced this personally when I called pivot on a DataFrame without
providing explicit values, and Spark took around 20 minutes to compute the
distinct pivot values. Looking at the docs, I felt that "less efficient" didn't
accurately represent this behavior.

### Does this PR introduce _any_ user-facing change?

Yes, updated user docs.

### How was this patch tested?

I built and reviewed the docs locally.

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-47252][DOCS] Clarify that pivot may trigger an eager computation [spark]

Reply via email to