[GitHub] [spark] felixcheung commented on issue #24991: [SPARK-28188] Materialize Dataframe API

GitBox Sun, 14 Jul 2019 22:48:00 -0700

felixcheung commented on issue #24991: [SPARK-28188] Materialize Dataframe API
URL: https://github.com/apache/spark/pull/24991#issuecomment-511280363
 
 
   > @rxin, this runs the query up to the point where `materialize` is called. 
The underlying RDD can then pick up from the last shuffle the next time it is 
used. This works better than caching in most cases when using dynamic 
allocation because executors are not sitting idle, but work can be resumed and 
shared across queries. We could rename the method if that would be more clear.
   > 
   > @srowen, I've seen this suggested on the dev list a few times and I think 
it is a good idea to add it. There is not guarantee that `count` does the same 
thing -- it could be optimized -- and it is a little tricky to get this to work 
with the dataset API. This version creates a new DataFrame from the underlying 
RDD so that the work is reused from the last shuffle, instead of allowing the 
planner to re-optimize with later changes (usually projections) and discard the 
intermediate result. We have found this really useful for better control over 
the planner, as well as to cache data using the shuffle system.
   
   I have to agree with this - I've seen `count()` or `cache()` mis-used too 
many times and too many times people need to go back to clean up and remove all 
calls to `count()`. So much so I'm planning to write a optimizer rule to remove 
them. I'm only partly kidding.
   
   Maybe this isn't the API for it, and that's ok, let's improve it then and 
make good suggestion to the community/contributor etc.
   
   I'm not sure `df.write.format("noop").save` is a good suggestion to general 
spark user.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] felixcheung commented on issue #24991: [SPARK-28188] Materialize Dataframe API

Reply via email to