pabloalcain opened a new pull request #35045:
URL: https://github.com/apache/spark/pull/35045
This is still missing a way to handle typing and also a set of tests
for the API, but it's a functional implementation to showcase the examples
written down in the PR and the JIRA ticket.
### What changes were proposed in this pull request?
It adds a new class, `DynamicDataFrame`, that allows users to implement
inheritance on the `DataFrames` without losing chainability. The PR also has
the autogenerator of the `DynamicDataFrame` class code in case any reviewer
wants to remove one of the methods from the default-inherited ones.
### Why are the changes needed?
In typical development settings, multiple tables with very different
concepts are mapped to the same `DataFrame` class. The inheritance from the
pyspark `DataFrame` class is a bit cumbersome because of the chainable methods
and it also makes it difficult to abstract regularly used queries. The proposal
is to generate a `DynamicDataFrame` that allows easy inheritance retaining
`DataFrame` methods without losing chainability neither for the newly generated
queries nor for the usual dataframe ones.
In our experience, this allowed us to iterate much faster, generating
business-centric classes in a couple of lines of code.
### Does this PR introduce _any_ user-facing change?
Yes: it adds a new class. It doesn't change any already existing code.
### How was this patch tested?
There is no test suite here since we are not sure about how to properly test
the new API (we are inclined to generate one test case per method, ensuring
that it generates objects of the proper type). We add here two code examples:
#### Inheriting from DataFrame
This shows the typical issues encountered when we try to inherit from
pyspark `DataFrame`. It should work on the `master` branch as well
```python
import pyspark
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
spark = pyspark.sql.SparkSession.builder.getOrCreate()
class Inventory(DataFrame):
def __init__(self, df: DataFrame):
super().__init__(df._jdf, df.sql_ctx)
def update_prices(self, factor: float = 2.0):
return self.withColumn("price", F.col("price") * factor)
base_dataframe = spark.createDataFrame(
data=[["product_1", 2.0], ["product_2", 4.0]],
schema=["name", "price"],
)
inventory = Inventory(base_dataframe)
inventory_updated = inventory.update_prices(2.0)
print("inventory_updated.show():")
inventory_updated.show()
print("But after one use of the query we have a plain dataframe again")
print(f"type(inventory_updated): {type(inventory_updated) }")
# This would raise an AttributeError
# inventory_updated.update_prices(5.0)
print("The same happens when we use DataFrame methods")
expensive_inventory = inventory.filter(F.col("price") > 3.0)
print("expensive_inventory.show():")
expensive_inventory.show()
print(f"type(expensive_inventory): {type(expensive_inventory) }")
```
and its output
```
inventory_updated.show():
+---------+-----+
| name|price|
+---------+-----+
|product_1| 4.0|
|product_2| 8.0|
+---------+-----+
But after one use of the query we have a plain dataframe again
type(inventory_updated): <class 'pyspark.sql.dataframe.DataFrame'>
The same happens when we use DataFrame methods
expensive_inventory.show():
+---------+-----+
| name|price|
+---------+-----+
|product_2| 4.0|
+---------+-----+
type(expensive_inventory): <class 'pyspark.sql.dataframe.DataFrame'>
```
#### Inheritance from DynamicDataFrame
This is what inheritance would look like if we used `DynamicDataFrame`, that
runs on the current branch:
```python
import pyspark
from pyspark.sql import DynamicDataFrame
from pyspark.sql import functions as F
spark = pyspark.sql.SparkSession.builder.getOrCreate()
class Inventory(DynamicDataFrame):
def update_prices(self, factor: float = 2.0):
return self.withColumn("price", F.col("price") * factor)
base_dataframe = spark.createDataFrame(
data=[["product_1", 2.0], ["product_2", 4.0]],
schema=["name", "price"],
)
print("Doing an inheritance mediated by DynamicDataFrame")
inventory = Inventory(base_dataframe)
inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
print("inventory_updated.show():")
inventory_updated.show()
print("After multiple uses of the query we still have the desired type")
print(f"type(inventory_updated): {type(inventory_updated)}")
print("We can still use the usual dataframe methods")
expensive_inventory = inventory_updated.filter(F.col("price") > 25)
print("expensive_inventory.show():")
expensive_inventory.show()
print("And retain the desired type")
print(f"type(expensive_inventory): {type(expensive_inventory)}")
```
and its output:
```
Doing an inheritance mediated by DynamicDataFrame
inventory_updated.show():
+---------+-----+
| name|price|
+---------+-----+
|product_1| 20.0|
|product_2| 40.0|
+---------+-----+
After multiple uses of the query we still have the desired type
type(inventory_updated): <class '__main__.Inventory'>
We can still use the usual dataframe methods
expensive_inventory.show():
+---------+-----+
| name|price|
+---------+-----+
|product_2| 40.0|
+---------+-----+
And retain the desired type
type(expensive_inventory): <class '__main__.Inventory'>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]