This seems like a lot of trouble for not so common use case that has viable alternatives. Once you assume that class is intended for inheritance (which, arguably we neither do or imply a the moment) you're even more restricted that we are right now, according to the project policy and need for keeping things synchronized across all languages.
On Scala side, I would rather expect to see type classes than direct
inheritance so this might be a dead feature from the start.
As of Python (sorry if I missed something in the preceding discussion),
quite natural approach would be to wrap DataFrame instance in your
business class and delegate calls to the wrapped object. A very naive
implementation could look like this
from functools import wraps
class BusinessModel:
@classmethod
def delegate(cls, a):
def _(*args, **kwargs):
result = a(*args, **kwargs)
if isinstance(result, DataFrame):
return cls(result)
else:
return result
if callable(a):
return wraps(a)(_)
else:
return a
def __init__(self, df):
self._df = df
def __getattr__(self, name):
return BusinessModel.delegate(getattr(self._df, name))
def with_price(self, price=42):
return self.selectExpr("*", f"{price} as price")
(BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
.select("id")
.with_price(0.0)
.select("price")
.show())
but it can be easily adjusted to handle more complex uses cases,
including inheritance.
On 12/29/21 12:54, Pablo Alcain wrote:
> Hey everyone! I'm re-sending this e-mail, now with a PR proposal
> (https://github.com/apache/spark/pull/35045
> <https://github.com/apache/spark/pull/35045> if you want to take a look
> at the code with a couple of examples). The proposed change includes
> only a new class that would extend only the Python API without doing any
> change to the underlying scala code. The benefit would be that the new
> code only extends previous functionality without breaking any existing
> application code, allowing pyspark users to try it out and see if it
> turns out to be useful. Hyukjin Kwon
> <https://github.com/HyukjinKwon> commented that a drawback with this
> would be that, if we do this, it would be hard to deprecate later the
> `DynamicDataFrame` API. The other option, if we want this inheritance to
> be feasible, is to directly implement this "casting" directly on the
> `DataFrame` code, so for example it would change from
>
> def limit(self, num: int) -> "DataFrame":
> jdf = self._jdf.limit(num)
> return DataFrame(jdf, self.sql_ctx)
>
> to
>
> def limit(self, num: int) -> "DataFrame":
> jdf = self._jdf.li <http://jdf.li> mit(num)
> return self.__class__(jdf, self.sql_ctx) # type(self) would work as well
>
> This approach would probably need to implement similar changes on the
> Scala API as well in order to allow this kind of inheritance on Scala as
> well (unfortunately I'm not knowledgable enough in Scala to figure out
> what the changes would be exactly)
>
> I wanted to gather your input on this idea, whether you think it can be
> helpful or not, and what would be the best strategy, in your opinion, to
> pursue it.
>
> Thank you very much!
> Pablo
>
> On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain
> <[email protected]
> <mailto:[email protected]>> wrote:
>
> tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would
> make it easier to inherit from it while keeping dataframe methods.
>
> Hello everyone. We have been working for a long time with PySpark
> and more specifically with DataFrames. In our pipelines we have
> several tables, with specific purposes, that we usually load as
> DataFrames. As you might expect, there are a handful of queries and
> transformations per dataframe that are done many times, so we
> thought of ways that we could abstract them:
>
> 1. Functions: using functions that call dataframes and returns them
> transformed. It had a couple of pitfalls: we had to manage the
> namespaces carefully, and also the "chainability" didn't feel very
> pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched
> (https://stackoverflow.com/questions/5626193/what-is-monkey-patching
> <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>)
> methods with the regularly done queries inside the DataFrame class.
> This one kept it pyspark-y, but there was no easy way to handle
> segregated namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit
> from `DataFrame` and implement the methods there. This one solves
> all the issues, but with a caveat: the chainable methods cast the
> result explicitly to `DataFrame` (see
>
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
>
> <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>
> e g). Therefore, everytime you use one of the parent's methods you'd
> have to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>
> In view of these pitfalls we decided to go for a slightly different
> approach, inspired by #3: We created a class called
> `DynamicDataFrame` that overrides the explicit call to `DataFrame`
> as done in PySpark but instead casted dynamically to
> `self.__class__` (see
>
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
>
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>
> e g). This allows the fluent methods to always keep the same class,
> making chainability as smooth as it is with pyspark dataframes.
>
> As an example implementation, here's a link to a gist
> (https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>)
> that implemented dynamically `withColumn` and `select` methods and
> the expected output.
>
> I'm sharing this here in case you feel like this approach can be
> useful for anyone else. In our case it greatly sped up the
> development of abstraction layers and allowed us to write cleaner
> code. One of the advantages is that it would simply be a "plugin"
> over pyspark, that does not modify anyhow already existing code or
> application interfaces.
>
> If you think that this can be helpful, I can write a PR as a more
> refined proof of concept.
>
> Thanks!
>
> Pablo
>
--
Best regards,
Maciej Szymkiewicz
Web: https://zero323.net
PGP: A30CEF0C31A501EC
OpenPGP_signature
Description: OpenPGP digital signature
