[GitHub] [spark] pabloalcain opened a new pull request #35045: [SPARK-37765][PYSPARK][WIP] DynamicDataFrame implementation

GitBox Tue, 28 Dec 2021 14:45:33 -0800


pabloalcain opened a new pull request #35045:
URL: https://github.com/apache/spark/pull/35045



   This is still missing a way to handle typing and also a set of tests
   for the API, but it's a functional implementation to showcase the examples
   written down in the PR and the JIRA ticket.
   
   ### What changes were proposed in this pull request?
   It adds a new class, `DynamicDataFrame`, that allows users to implement 
inheritance on the `DataFrames` without losing chainability. The PR also has 
the autogenerator of the `DynamicDataFrame` class code in case any reviewer 
wants to remove one of the methods from the default-inherited ones.
   
   ### Why are the changes needed?
   In typical development settings, multiple tables with very different 
concepts are mapped to the same `DataFrame` class. The inheritance from the 
pyspark `DataFrame` class is a bit cumbersome because of the chainable methods 
and it also makes it difficult to abstract regularly used queries. The proposal 
is to generate a `DynamicDataFrame` that allows easy inheritance retaining 
`DataFrame` methods without losing chainability neither for the newly generated 
queries nor for the usual dataframe ones.
   
   In our experience, this allowed us to iterate much faster, generating 
business-centric classes in a couple of lines of code.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes: it adds a new class. It doesn't change any already existing code.
   
   
   ### How was this patch tested?
   There is no test suite here since we are not sure about how to properly test 
the new API (we are inclined to generate one test case per method, ensuring 
that it generates objects of the proper type). We add here two code examples:
   
   #### Inheriting from DataFrame
   This shows the typical issues encountered when we try to inherit from 
pyspark `DataFrame`. It should work on the `master` branch as well
   
   ```python
   import pyspark
   from pyspark.sql import DataFrame
   from pyspark.sql import functions as F
   
   spark = pyspark.sql.SparkSession.builder.getOrCreate()
   
   
   class Inventory(DataFrame):
       def __init__(self, df: DataFrame):
           super().__init__(df._jdf, df.sql_ctx)
   
       def update_prices(self, factor: float = 2.0):
           return self.withColumn("price", F.col("price") * factor)
   
   
   base_dataframe = spark.createDataFrame(
       data=[["product_1", 2.0], ["product_2", 4.0]],
       schema=["name", "price"],
   )
   inventory = Inventory(base_dataframe)
   inventory_updated = inventory.update_prices(2.0)
   print("inventory_updated.show():")
   inventory_updated.show()
   print("But after one use of the query we have a plain dataframe again")
   print(f"type(inventory_updated): {type(inventory_updated) }")
   # This would raise an AttributeError
   # inventory_updated.update_prices(5.0)
   
   print("The same happens when we use DataFrame methods")
   expensive_inventory = inventory.filter(F.col("price") > 3.0)
   print("expensive_inventory.show():")
   expensive_inventory.show()
   print(f"type(expensive_inventory): {type(expensive_inventory) }")
   ```
   
   and its output
   ```
   inventory_updated.show():
   +---------+-----+
   |     name|price|
   +---------+-----+
   |product_1|  4.0|
   |product_2|  8.0|
   +---------+-----+
   
   But after one use of the query we have a plain dataframe again
   type(inventory_updated): <class 'pyspark.sql.dataframe.DataFrame'>
   The same happens when we use DataFrame methods
   expensive_inventory.show():
   +---------+-----+
   |     name|price|
   +---------+-----+
   |product_2|  4.0|
   +---------+-----+
   
   type(expensive_inventory): <class 'pyspark.sql.dataframe.DataFrame'>
   ```
   #### Inheritance from DynamicDataFrame
   This is what inheritance would look like if we used `DynamicDataFrame`, that 
runs on the current branch:
   
   ```python
   import pyspark
   from pyspark.sql import DynamicDataFrame
   from pyspark.sql import functions as F
   
   spark = pyspark.sql.SparkSession.builder.getOrCreate()
   
   
   class Inventory(DynamicDataFrame):
       def update_prices(self, factor: float = 2.0):
           return self.withColumn("price", F.col("price") * factor)
   
   
   base_dataframe = spark.createDataFrame(
       data=[["product_1", 2.0], ["product_2", 4.0]],
       schema=["name", "price"],
   )
   print("Doing an inheritance mediated by DynamicDataFrame")
   inventory = Inventory(base_dataframe)
   inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
   print("inventory_updated.show():")
   inventory_updated.show()
   print("After multiple uses of the query we still have the desired type")
   print(f"type(inventory_updated): {type(inventory_updated)}")
   print("We can still use the usual dataframe methods")
   expensive_inventory = inventory_updated.filter(F.col("price") > 25)
   print("expensive_inventory.show():")
   expensive_inventory.show()
   print("And retain the desired type")
   print(f"type(expensive_inventory): {type(expensive_inventory)}")
   ```
   
   and its output:
   ```
   Doing an inheritance mediated by DynamicDataFrame
   inventory_updated.show():
   +---------+-----+
   |     name|price|
   +---------+-----+
   |product_1| 20.0|
   |product_2| 40.0|
   +---------+-----+
   
   After multiple uses of the query we still have the desired type
   type(inventory_updated): <class '__main__.Inventory'>
   We can still use the usual dataframe methods
   expensive_inventory.show():
   +---------+-----+
   |     name|price|
   +---------+-----+
   |product_2| 40.0|
   +---------+-----+
   
   And retain the desired type
   type(expensive_inventory): <class '__main__.Inventory'>
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] pabloalcain opened a new pull request #35045: [SPARK-37765][PYSPARK][WIP] DynamicDataFrame implementation

Reply via email to