Re: PySpark Dynamic DataFrame for easier inheritance

Maciej Wed, 29 Dec 2021 08:43:03 -0800

On 12/29/21 16:18, Pablo Alcain wrote:
> Hey Maciej! Thanks for your answer and the comments :) 
> 
> On Wed, Dec 29, 2021 at 3:06 PM Maciej <mszymkiew...@gmail.com
> <mailto:mszymkiew...@gmail.com>> wrote:
> 
>     This seems like a lot of trouble for not so common use case that has
>     viable alternatives. Once you assume that class is intended for
>     inheritance (which, arguably we neither do or imply a the moment) you're
>     even more restricted that we are right now, according to the project
>     policy and need for keeping things synchronized across all languages.
> 
> By "this" you mean the modification of the DataFrame, the implementation
> of a new pyspark class (DynamicDataFrame in this case) or the approach
> in general?


I mean promoting DataFrame as extensible in general. It is a risk of
getting stuck with specific API, even more than we are right now, with
little reward at the end.

Additionally:

- As far as I am aware nothing suggests that it is widely requested
feature (corresponding SO questions didn't get much traffic over the
years and I don't think we have any preceding JIRA tickets).
- It can be addressed outside the project (within user codebase or as a
standalone package) with minimal or no overhead.

That being said ‒ if we're going to rewrite Python DataFrame methods to
return instance type, I strongly believe that the existing methods
should be marked as final.

>  
> 
> 
>     On Scala side, I would rather expect to see type classes than direct
>     inheritance so this might be a dead feature from the start.
> 
>     As of Python (sorry if I missed something in the preceding discussion),
>     quite natural approach would be to wrap DataFrame instance in your
>     business class and delegate calls to the wrapped object. A very naive
>     implementation could look like this
> 
>     from functools import wraps
> 
>     class BusinessModel:
>         @classmethod
>         def delegate(cls, a):
>             def _(*args, **kwargs):
>                 result = a(*args, **kwargs)
>                 if isinstance(result, DataFrame):
>                     return  cls(result)
>                 else:
>                     return result
> 
>             if callable(a):
>                 return wraps(a)(_)
>             else:
>                 return a
> 
>         def __init__(self, df):
>             self._df = df
> 
>         def __getattr__(self, name):
>             return BusinessModel.delegate(getattr(self._df, name))
> 
>         def with_price(self, price=42):
>             return self.selectExpr("*", f"{price} as price")
> 
> 
> 
> Yes, effectively the solution is very similar to this one. I believe
> that the advantage of doing it without hijacking with the decorator the
> delegation is that you can still maintain static typing.

You can maintain type checker compatibility (it is easier with stubs,
but you can do it with inline hints as well, if I recall correctly) here
as well.

> On the other
> hand (and this is probably a minor issue), when following this approach
> with the `isinstance` checking for the casting you might end up casting
> the `.summary()` and `.describe()` methods that probably you want still
> to keep as "pure" DataFrames. If you see it from this perspective, then
> "DynamicDataFrame" would be the boilerplate code that allows you to
> decide more granularly what methods you want to delegate.

You can do it with `__getattr__` as well. There are probably some edge
cases (especially when accessing columns with `.`), but it should be
still manageable.


Just to be clear ‒ I am not insisting that this is somehow superior
solution (there are things that cannot be done through delegation).

> 
>     (BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
>         .select("id")
>         .with_price(0.0)
>         .select("price")
>         .show())
> 
> 
>     but it can be easily adjusted to handle more complex uses cases,
>     including inheritance.
> 
> 
> 
>     On 12/29/21 12:54, Pablo Alcain wrote:
>     > Hey everyone! I'm re-sending this e-mail, now with a PR proposal
>     > (https://github.com/apache/spark/pull/35045
>     <https://github.com/apache/spark/pull/35045>
>     > <https://github.com/apache/spark/pull/35045
>     <https://github.com/apache/spark/pull/35045>> if you want to take a look
>     > at the code with a couple of examples). The proposed change includes
>     > only a new class that would extend only the Python API without
>     doing any
>     > change to the underlying scala code. The benefit would be that the new
>     > code only extends previous functionality without breaking any existing
>     > application code, allowing pyspark users to try it out and see if it
>     > turns out to be useful. Hyukjin Kwon
>     > <https://github.com/HyukjinKwon
>     <https://github.com/HyukjinKwon>> commented that a drawback with this
>     > would be that, if we do this, it would be hard to deprecate later the
>     > `DynamicDataFrame` API. The other option, if we want this
>     inheritance to
>     > be feasible, is to directly implement this "casting" directly on the
>     > `DataFrame` code, so for example it would change from 
>     >
>     > def limit(self, num: int) -> "DataFrame":
>     >     jdf = self._jdf.limit(num)
>     >     return DataFrame(jdf, self.sql_ctx)
>     >
>     > to
>     >
>     > def limit(self, num: int) -> "DataFrame":
>     >     jdf = self._jdf.li <http://jdf.li> <http://jdf.li
>     <http://jdf.li>> mit(num)
>     >     return self.__class__(jdf, self.sql_ctx) # type(self) would
>     work as well
>     >
>     > This approach would probably need to implement similar changes on the
>     > Scala API as well in order to allow this kind of inheritance on
>     Scala as
>     > well (unfortunately I'm not knowledgable enough in Scala to figure out
>     > what the changes would be exactly)
>     >
>     > I wanted to gather your input on this idea, whether you think it
>     can be
>     > helpful or not, and what would be the best strategy, in your
>     opinion, to
>     > pursue it.
>     >
>     > Thank you very much!
>     > Pablo
>     >
>     > On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain
>     > <pablo.alc...@wildlifestudios.com
>     <mailto:pablo.alc...@wildlifestudios.com>
>     > <mailto:pablo.alc...@wildlifestudios.com
>     <mailto:pablo.alc...@wildlifestudios.com>>> wrote:
>     >
>     >     tl;dr: a proposal for a pyspark "DynamicDataFrame" class that
>     would
>     >     make it easier to inherit from it while keeping dataframe methods.
>     >
>     >     Hello everyone. We have been working for a long time with PySpark
>     >     and more specifically with DataFrames. In our pipelines we have
>     >     several tables, with specific purposes, that we usually load as
>     >     DataFrames. As you might expect, there are a handful of
>     queries and
>     >     transformations per dataframe that are done many times, so we
>     >     thought of ways that we could abstract them:
>     >
>     >     1. Functions: using functions that call dataframes and returns
>     them
>     >     transformed. It had a couple of pitfalls: we had to manage the
>     >     namespaces carefully, and also the "chainability" didn't feel very
>     >     pyspark-y.
>     >     2. MonkeyPatching DataFrame: we monkeypatched
>     >   
>      (https://stackoverflow.com/questions/5626193/what-is-monkey-patching 
> <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>
>     >   
>      <https://stackoverflow.com/questions/5626193/what-is-monkey-patching 
> <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>>)
>     >     methods with the regularly done queries inside the DataFrame
>     class.
>     >     This one kept it pyspark-y, but there was no easy way to handle
>     >     segregated namespaces/
>     >     3. Inheritances: create the class `MyBusinessDataFrame`, inherit
>     >     from `DataFrame` and implement the methods there. This one solves
>     >     all the issues, but with a caveat: the chainable methods cast the
>     >     result explicitly to `DataFrame` (see
>     >   
>      
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
>     
> <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>
>     >   
>      
> <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
>     
> <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>>
>     >     e g). Therefore, everytime you use one of the parent's methods
>     you'd
>     >     have to re-cast to `MyBusinessDataFrame`, making the code
>     cumbersome.
>     >
>     >     In view of these pitfalls we decided to go for a slightly
>     different
>     >     approach, inspired by #3: We created a class called
>     >     `DynamicDataFrame` that overrides the explicit call to `DataFrame`
>     >     as done in PySpark but instead casted dynamically to
>     >     `self.__class__` (see
>     >   
>      
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
>     
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>
>     >   
>      
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
>     
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>>
>     >     e g). This allows the fluent methods to always keep the same
>     class,
>     >     making chainability as smooth as it is with pyspark dataframes.
>     >
>     >     As an example implementation, here's a link to a gist
>     >   
>      (https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e
>     <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>
>     <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e 
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>>)
>     >     that implemented dynamically `withColumn` and `select` methods and
>     >     the expected output.
>     >
>     >     I'm sharing this here in case you feel like this approach can be
>     >     useful for anyone else. In our case it greatly sped up the
>     >     development of abstraction layers and allowed us to write cleaner
>     >     code. One of the advantages is that it would simply be a "plugin"
>     >     over pyspark, that does not modify anyhow already existing code or
>     >     application interfaces.
>     >
>     >     If you think that this can be helpful, I can write a PR as a more
>     >     refined proof of concept.
>     >
>     >     Thanks!
>     >
>     >     Pablo
>     >
> 
> 
>     -- 
>     Best regards,
>     Maciej Szymkiewicz
> 
>     Web: https://zero323.net <https://zero323.net>
>     PGP: A30CEF0C31A501EC
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

OpenPGP_signature
Description: OpenPGP digital signature

Re: PySpark Dynamic DataFrame for easier inheritance

Reply via email to