That all sounds reasonable but I think in the case of 4 and maybe also 3 I would rather see it implemented to raise an error message that explains what’s going on and suggests the explicit operation that would do the most equivalent thing. And perhaps raise a warning (using the warnings module) for things that might be unintuitively expensive. On Fri, Oct 26, 2018 at 12:15 Holden Karau <hol...@pigscanfly.ca> wrote:
> Coming out of https://github.com/apache/spark/pull/21654 it was agreed > the helper methods in question made sense but there was some desire for a > plan as to which helper methods we should use. > > I'd like to purpose a light weight solution to start with for helper > methods that match either Pandas or general Python collection helper > methods: > 1) If the helper method doesn't collect the DataFrame back or force > evaluation to the driver then we should add it without discussion > 2) If the method forces evaluation this matches most obvious way that > would implemented then we should add it with a note in the docstring > 3) If the method does collect the DataFrame back to the driver and that is > the most obvious way it would implemented (e.g. calling list to get back a > list would have to collect the DataFrame) then we should add it with a > warning in the docstring > 4) If the method collects the DataFrame but a reasonable Python developer > wouldn't expect that behaviour not implementing the helper method would be > better > > What do folks think? > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- -- Cheers, Leif