nchammas opened a new pull request, #45363:
URL: https://github.com/apache/spark/pull/45363

   ### What changes were proposed in this pull request?
   
   Clarify that, if explicit pivot values are not provided, Spark will eagerly 
compute them.
   
   ### Why are the changes needed?
   
   The current wording on `master` is misleading. To say that one version of 
pivot is more or less "efficient" than the other glosses over the fact that one 
is lazy and the other is not. Spark users are trained from early on that 
transformations are generally lazy; exceptions to this rule should be more 
clearly highlighted.
   
   I experienced this personally when I called pivot on a DataFrame without 
providing explicit values, and Spark took around 20 minutes to compute the 
distinct pivot values. Looking at the docs, I felt that "less efficient" didn't 
accurately represent this behavior.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, updated user docs.
   
   ### How was this patch tested?
   
   I built and reviewed the docs locally.
   
   <img width="300" 
src="https://github.com/apache/spark/assets/1039369/0079e6ea-6a5a-4a00-a1ad-45a08c07f716";
 />
   <img width="400" 
src="https://github.com/apache/spark/assets/1039369/8a4873a0-e80f-408c-aa38-eac5fa51611c";
 />
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to