Helper methods for PySpark discussion

Holden Karau Fri, 26 Oct 2018 09:16:03 -0700

Coming out of https://github.com/apache/spark/pull/21654 it was agreed the
helper methods in question made sense but there was some desire for a plan
as to which helper methods we should use.


I'd like to purpose a light weight solution to start with for helper
methods that match either Pandas or general Python collection helper
methods:
1) If the helper method doesn't collect the DataFrame back or force
evaluation to the driver then we should add it without discussion
2) If the method forces evaluation this matches most obvious way that would
implemented then we should add it with a note in the docstring
3) If the method does collect the DataFrame back to the driver and that is
the most obvious way it would implemented (e.g. calling list to get back a
list would have to collect the DataFrame) then we should add it with a
warning in the docstring
4) If the method collects the DataFrame but a reasonable Python developer
wouldn't expect that behaviour not implementing the helper method would be
better

What do folks think?
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Helper methods for PySpark discussion

Reply via email to