rdblue commented on issue #24991: [SPARK-28188] Materialize Dataframe API
URL: https://github.com/apache/spark/pull/24991#issuecomment-508290694
 
 
   @rxin, this runs the query up to the point where `materialize` is called. 
The underlying RDD can then pick up from the last shuffle the next time it is 
used. This works better than caching in most cases when using dynamic 
allocation because executors are not sitting idle, but work can be resumed and 
shared across queries. We could rename the method if that would be more clear.
   
   @srowen, I've seen this suggested on the dev list a few times and I think it 
is a good idea to add it. There is not guarantee that `count` does the same 
thing -- it could be optimized -- and it is a little tricky to  get this to 
work with the dataset API. This version creates a new DataFrame from the 
underlying RDD so  that the work is reused from the last shuffle, instead of 
allowing the planner to re-optimize with later changes (usually projections) 
and discard the intermediate result. We have found this really useful for 
better control over the planner, as well as to cache data using the shuffle 
system.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to