HyukjinKwon opened a new pull request, #43784:
URL: https://github.com/apache/spark/pull/43784
### What changes were proposed in this pull request?
This PR proposes to support Python Data Source can be with SQL, SparkR and
all other exiting combinations by wrapping the Python Data Source by DSv1
interface.
The approach is as follows:
1. PySpark registers a Python Data Source with its short name.
2. Later, when the Data Sources are looked up, JVM creates a class that
inherits DSv1 class dynamically that has the same short name as registered in
Python.
3. The returned class invokes Python Data Source, and works wherever it
works with DSv1 even including SparkR, Scala, all SQL places.
For example as below:
```python
from pyspark.sql.datasource import DataSource, DataSourceReader
class SimpleDataSourceReader(DataSourceReader):
def __init__(self, paths, options):
self.paths = paths
self.options = options
def partitions(self):
return iter(self.paths)
def read(self, path):
yield (path, 1)
class SimpleDataSource(DataSource):
@classmethod
def name(cls) -> str:
return "test"
def schema(self) -> str:
return "id STRING, value INT"
def reader(self, schema):
return SimpleDataSourceReader(self.paths, self.options)
```
```python
spark.dataSource.register(SimpleDataSource)
sql("CREATE TABLE tblA USING test")
sql("SELECT value from tblA").show()
```
results in:
```
+-----+
|value|
+-----+
| 1|
+-----+
```
_There are limitations and followups to make:_
1. We should change the dynamically generated classname from
`org.apache.spark.sql.execution.datasources.PythonTableScan` to something else
that maps to individual Python Data Source so the classes are not confused.
2. Whenever you load Python Data Source, it creates a new class dynamically
generated.
3. If you save this table after you restart your driver, the table cannot be
loaded because the dynamically generated class does not exist anymore. Should
figure out the way of reloading them.
4. Multi-paths are not supported (inherited from DSv1)
5. Using a wrapper of DSv1 might be a blocker to implement commit protocol
in Python Data Source.
### Why are the changes needed?
In order for Python Data Source to be able to be used in all other place
including SparkR, Scala together.
### Does this PR introduce _any_ user-facing change?
Yes. Users can register their Python Data Source, and use them in SQL,
SparkR, etc.
### How was this patch tested?
Unittests were added, and manually tested.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]