nchammas opened a new pull request, #45363: URL: https://github.com/apache/spark/pull/45363
### What changes were proposed in this pull request? Clarify that, if explicit pivot values are not provided, Spark will eagerly compute them. ### Why are the changes needed? The current wording on `master` is misleading. To say that one version of pivot is more or less "efficient" than the other glosses over the fact that one is lazy and the other is not. Spark users are trained from early on that transformations are generally lazy; exceptions to this rule should be more clearly highlighted. I experienced this personally when I called pivot on a DataFrame without providing explicit values, and Spark took around 20 minutes to compute the distinct pivot values. Looking at the docs, I felt that "less efficient" didn't accurately represent this behavior. ### Does this PR introduce _any_ user-facing change? Yes, updated user docs. ### How was this patch tested? I built and reviewed the docs locally. <img width="300" src="https://github.com/apache/spark/assets/1039369/0079e6ea-6a5a-4a00-a1ad-45a08c07f716" /> <img width="400" src="https://github.com/apache/spark/assets/1039369/8a4873a0-e80f-408c-aa38-eac5fa51611c" /> ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
