aa1371 opened a new pull request #35083:
URL: https://github.com/apache/spark/pull/35083


   JIRA: https://issues.apache.org/jira/browse/SPARK-37798
   
   Pandas currently supports a `how="cross"` merge which provides a cartesian 
product of the left/right tables. This can be achieved by doing a 
`spark.sql.dataframe.join(..., on=None, how="inner")`.
   
   Additionally, I am currently in the middle of adding conditional merging in 
the pandas API (see PR here: https://github.com/pandas-dev/pandas/pull/42964). 
This is much easier to achieve in spark, since the functionality is already 
available, and we can trivially expose it in the pyspark pandas API. Due to the 
demand  of this functionality (countless SO/pandas issues either asking how to 
do this, or asking questions that would be solved by this), I think that this 
would be worth adding even before it makes it into the core pandas API.
   
   These changes will be purely incremental on top of the existing API, and 
will be completely backwards compatible.
   
   Still need to add tests and docstring examples.
   
   
   **Examples:**
   
   **Example DFs:**
   ```
   >>> df1 = pd.DataFrame([['Bill', 23], ['Mary', 33], ['Ted', 36]], 
columns=['name', 'age'])
   >>> df2 = pd.DataFrame([['President', 35], ['Senator', 30]], columns=['job', 
'min_age'])
   >>> df1
      name  age
   0  Bill   23
   1  Mary   33
   2   Ted   36
   
   >>> df2
            job  min_age
   0  President       35
   1    Senator       30
   ```
   
   **Cross  Merge Example:**
   ```
   >>> df1.merge(df2, how="cross")
      name  age        job  min_age
   0  Bill   23  President       35
   1  Bill   23    Senator       30
   2  Mary   33  President       35
   3  Mary   33    Senator       30
   4   Ted   36  President       35
   5   Ted   36    Senator       30
   ```
   
   **Conditional Merge Example:**
   ```
   >>> df1.merge(df2, on=lambda left, right: left.age >= right.min_age)
      name  age        job  min_age
   0  Mary   33    Senator       30
   1   Ted   36  President       35
   2   Ted   36    Senator       30
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to