Re: PySpark Dynamic DataFrame for easier inheritance

Maciej Wed, 29 Dec 2021 06:06:32 -0800

This seems like a lot of trouble for not so common use case that has
viable alternatives. Once you assume that class is intended for
inheritance (which, arguably we neither do or imply a the moment) you're
even more restricted that we are right now, according to the project
policy and need for keeping things synchronized across all languages.


On Scala side, I would rather expect to see type classes than direct
inheritance so this might be a dead feature from the start.

As of Python (sorry if I missed something in the preceding discussion),
quite natural approach would be to wrap DataFrame instance in your
business class and delegate calls to the wrapped object. A very naive
implementation could look like this

from functools import wraps

class BusinessModel:
    @classmethod
    def delegate(cls, a):
        def _(*args, **kwargs):
            result = a(*args, **kwargs)
            if isinstance(result, DataFrame):
                return  cls(result)
            else:
                return result

        if callable(a):
            return wraps(a)(_)
        else:
            return a

    def __init__(self, df):
        self._df = df

    def __getattr__(self, name):
        return BusinessModel.delegate(getattr(self._df, name))

    def with_price(self, price=42):
        return self.selectExpr("*", f"{price} as price")


(BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
    .select("id")
    .with_price(0.0)
    .select("price")
    .show())


but it can be easily adjusted to handle more complex uses cases,
including inheritance.



On 12/29/21 12:54, Pablo Alcain wrote:
> Hey everyone! I'm re-sending this e-mail, now with a PR proposal
> (https://github.com/apache/spark/pull/35045
> <https://github.com/apache/spark/pull/35045> if you want to take a look
> at the code with a couple of examples). The proposed change includes
> only a new class that would extend only the Python API without doing any
> change to the underlying scala code. The benefit would be that the new
> code only extends previous functionality without breaking any existing
> application code, allowing pyspark users to try it out and see if it
> turns out to be useful. Hyukjin Kwon
> <https://github.com/HyukjinKwon> commented that a drawback with this
> would be that, if we do this, it would be hard to deprecate later the
> `DynamicDataFrame` API. The other option, if we want this inheritance to
> be feasible, is to directly implement this "casting" directly on the
> `DataFrame` code, so for example it would change from 
> 
> def limit(self, num: int) -> "DataFrame":
>     jdf = self._jdf.limit(num)
>     return DataFrame(jdf, self.sql_ctx)
> 
> to
> 
> def limit(self, num: int) -> "DataFrame":
>     jdf = self._jdf.li <http://jdf.li> mit(num)
>     return self.__class__(jdf, self.sql_ctx) # type(self) would work as well
> 
> This approach would probably need to implement similar changes on the
> Scala API as well in order to allow this kind of inheritance on Scala as
> well (unfortunately I'm not knowledgable enough in Scala to figure out
> what the changes would be exactly)
> 
> I wanted to gather your input on this idea, whether you think it can be
> helpful or not, and what would be the best strategy, in your opinion, to
> pursue it.
> 
> Thank you very much!
> Pablo
> 
> On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain
> <pablo.alc...@wildlifestudios.com
> <mailto:pablo.alc...@wildlifestudios.com>> wrote:
> 
>     tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would
>     make it easier to inherit from it while keeping dataframe methods.
> 
>     Hello everyone. We have been working for a long time with PySpark
>     and more specifically with DataFrames. In our pipelines we have
>     several tables, with specific purposes, that we usually load as
>     DataFrames. As you might expect, there are a handful of queries and
>     transformations per dataframe that are done many times, so we
>     thought of ways that we could abstract them:
> 
>     1. Functions: using functions that call dataframes and returns them
>     transformed. It had a couple of pitfalls: we had to manage the
>     namespaces carefully, and also the "chainability" didn't feel very
>     pyspark-y.
>     2. MonkeyPatching DataFrame: we monkeypatched
>     (https://stackoverflow.com/questions/5626193/what-is-monkey-patching
>     <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>)
>     methods with the regularly done queries inside the DataFrame class.
>     This one kept it pyspark-y, but there was no easy way to handle
>     segregated namespaces/
>     3. Inheritances: create the class `MyBusinessDataFrame`, inherit
>     from `DataFrame` and implement the methods there. This one solves
>     all the issues, but with a caveat: the chainable methods cast the
>     result explicitly to `DataFrame` (see
>     
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
>     
> <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>
>     e g). Therefore, everytime you use one of the parent's methods you'd
>     have to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
> 
>     In view of these pitfalls we decided to go for a slightly different
>     approach, inspired by #3: We created a class called
>     `DynamicDataFrame` that overrides the explicit call to `DataFrame`
>     as done in PySpark but instead casted dynamically to
>     `self.__class__` (see
>     
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
>     
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>
>     e g). This allows the fluent methods to always keep the same class,
>     making chainability as smooth as it is with pyspark dataframes.
> 
>     As an example implementation, here's a link to a gist
>     (https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e 
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>)
>     that implemented dynamically `withColumn` and `select` methods and
>     the expected output.
> 
>     I'm sharing this here in case you feel like this approach can be
>     useful for anyone else. In our case it greatly sped up the
>     development of abstraction layers and allowed us to write cleaner
>     code. One of the advantages is that it would simply be a "plugin"
>     over pyspark, that does not modify anyhow already existing code or
>     application interfaces.
> 
>     If you think that this can be helpful, I can write a PR as a more
>     refined proof of concept.
> 
>     Thanks!
> 
>     Pablo
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

OpenPGP_signature
Description: OpenPGP digital signature

Re: PySpark Dynamic DataFrame for easier inheritance

Reply via email to