This seems like a lot of trouble for not so common use case that has viable alternatives. Once you assume that class is intended for inheritance (which, arguably we neither do or imply a the moment) you're even more restricted that we are right now, according to the project policy and need for keeping things synchronized across all languages.
On Scala side, I would rather expect to see type classes than direct inheritance so this might be a dead feature from the start. As of Python (sorry if I missed something in the preceding discussion), quite natural approach would be to wrap DataFrame instance in your business class and delegate calls to the wrapped object. A very naive implementation could look like this from functools import wraps class BusinessModel: @classmethod def delegate(cls, a): def _(*args, **kwargs): result = a(*args, **kwargs) if isinstance(result, DataFrame): return cls(result) else: return result if callable(a): return wraps(a)(_) else: return a def __init__(self, df): self._df = df def __getattr__(self, name): return BusinessModel.delegate(getattr(self._df, name)) def with_price(self, price=42): return self.selectExpr("*", f"{price} as price") (BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month"))) .select("id") .with_price(0.0) .select("price") .show()) but it can be easily adjusted to handle more complex uses cases, including inheritance. On 12/29/21 12:54, Pablo Alcain wrote: > Hey everyone! I'm re-sending this e-mail, now with a PR proposal > (https://github.com/apache/spark/pull/35045 > <https://github.com/apache/spark/pull/35045> if you want to take a look > at the code with a couple of examples). The proposed change includes > only a new class that would extend only the Python API without doing any > change to the underlying scala code. The benefit would be that the new > code only extends previous functionality without breaking any existing > application code, allowing pyspark users to try it out and see if it > turns out to be useful. Hyukjin Kwon > <https://github.com/HyukjinKwon> commented that a drawback with this > would be that, if we do this, it would be hard to deprecate later the > `DynamicDataFrame` API. The other option, if we want this inheritance to > be feasible, is to directly implement this "casting" directly on the > `DataFrame` code, so for example it would change from > > def limit(self, num: int) -> "DataFrame": > jdf = self._jdf.limit(num) > return DataFrame(jdf, self.sql_ctx) > > to > > def limit(self, num: int) -> "DataFrame": > jdf = self._jdf.li <http://jdf.li> mit(num) > return self.__class__(jdf, self.sql_ctx) # type(self) would work as well > > This approach would probably need to implement similar changes on the > Scala API as well in order to allow this kind of inheritance on Scala as > well (unfortunately I'm not knowledgable enough in Scala to figure out > what the changes would be exactly) > > I wanted to gather your input on this idea, whether you think it can be > helpful or not, and what would be the best strategy, in your opinion, to > pursue it. > > Thank you very much! > Pablo > > On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain > <pablo.alc...@wildlifestudios.com > <mailto:pablo.alc...@wildlifestudios.com>> wrote: > > tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would > make it easier to inherit from it while keeping dataframe methods. > > Hello everyone. We have been working for a long time with PySpark > and more specifically with DataFrames. In our pipelines we have > several tables, with specific purposes, that we usually load as > DataFrames. As you might expect, there are a handful of queries and > transformations per dataframe that are done many times, so we > thought of ways that we could abstract them: > > 1. Functions: using functions that call dataframes and returns them > transformed. It had a couple of pitfalls: we had to manage the > namespaces carefully, and also the "chainability" didn't feel very > pyspark-y. > 2. MonkeyPatching DataFrame: we monkeypatched > (https://stackoverflow.com/questions/5626193/what-is-monkey-patching > <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>) > methods with the regularly done queries inside the DataFrame class. > This one kept it pyspark-y, but there was no easy way to handle > segregated namespaces/ > 3. Inheritances: create the class `MyBusinessDataFrame`, inherit > from `DataFrame` and implement the methods there. This one solves > all the issues, but with a caveat: the chainable methods cast the > result explicitly to `DataFrame` (see > > https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910 > > <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910> > e g). Therefore, everytime you use one of the parent's methods you'd > have to re-cast to `MyBusinessDataFrame`, making the code cumbersome. > > In view of these pitfalls we decided to go for a slightly different > approach, inspired by #3: We created a class called > `DynamicDataFrame` that overrides the explicit call to `DataFrame` > as done in PySpark but instead casted dynamically to > `self.__class__` (see > > https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21 > > <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21> > e g). This allows the fluent methods to always keep the same class, > making chainability as smooth as it is with pyspark dataframes. > > As an example implementation, here's a link to a gist > (https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e > <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>) > that implemented dynamically `withColumn` and `select` methods and > the expected output. > > I'm sharing this here in case you feel like this approach can be > useful for anyone else. In our case it greatly sped up the > development of abstraction layers and allowed us to write cleaner > code. One of the advantages is that it would simply be a "plugin" > over pyspark, that does not modify anyhow already existing code or > application interfaces. > > If you think that this can be helpful, I can write a PR as a more > refined proof of concept. > > Thanks! > > Pablo > -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC
OpenPGP_signature
Description: OpenPGP digital signature