HyukjinKwon opened a new pull request, #46129:
URL: https://github.com/apache/spark/pull/46129
### What changes were proposed in this pull request?
This PR proposes to have a parent `pyspark.sql.DataFrame` class which
`pyspark.sql.connect.dataframe.DataFrame` and
`pyspark.sql.classic.dataframe.DataFrame` inherit.
**Before**
1. `pyspark.sql.DataFrame` (Spark Claasic)
- docstrings
- Spark Classic logic
2. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect)
- Spark Connect logic
3. Users can only see the type hints from `pyspark.sql.DataFrame`.
**After**
1. `pyspark.sql.DataFrame` (Common)
- docstrings
- Support classmethod usages (dispatch to either Spark Connect or Spark
Classic)
2. `pyspark.sql.classic.dataframe.DataFrame` (Spark Classic)
- Spark Connect logic
3. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect)
- Spark Connect logic
4. Users can only see the type hints from `pyspark.sql.DataFrame`.
### Why are the changes needed?
This fixes two issues from the current structure at Spark Connect:
1. Support usage of regular methods as class methods, e.g.,
```python
from pyspark.sql import DataFrame
df = spark.range(10)
DataFrame.union(df, df)
```
**Before**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 4809, in union
return DataFrame(self._jdf.union(other._jdf), self.sparkSession)
^^^^^^^^^
File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1724, in
__getattr__
raise PySparkAttributeError(
pyspark.errors.exceptions.base.PySparkAttributeError:
[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark
Connect as it depends on the JVM. If you need to use this attribute, do not use
Spark Connect when creating your session. Visit
https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession
for creating regular Spark Session in detail.
```
**After**
```
DataFrame[id: bigint]
```
2. Supports `isinstance` call
```python
from pyspark.sql import DataFrame
isinstance(spark.range(1), DataFrame)
```
**Before**
```
False
```
**After**
```
True
```
### Does this PR introduce _any_ user-facing change?
Yes, as described above.
### How was this patch tested?
Manually tested, and CI should verify them.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]