HyukjinKwon opened a new pull request, #44233: URL: https://github.com/apache/spark/pull/44233
### What changes were proposed in this pull request? This PR is another approach of https://github.com/apache/spark/pull/43784 which proposes to support Python Data Source can be with SQL (in favour of https://github.com/apache/spark/pull/43949), SparkR and all other exiting combinations by wrapping the Python Data Source by DSv2 interface (but yet uses `V1Table` interface). The approach is: one Python Data Source wrapper looks up Python data sources Self-contained working example: ```python from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition class TestDataSourceReader(DataSourceReader): def __init__(self, options): self.options = options def partitions(self): return [InputPartition(i) for i in range(3)] def read(self, partition): yield partition.value, str(partition.value) class TestDataSource(DataSource): @classmethod def name(cls): return "test" def schema(self): return "x INT, y STRING" def reader(self, schema) -> "DataSourceReader": return TestDataSourceReader(self.options) ``` ```python spark.dataSource.register(TestDataSource) sql("CREATE TABLE tblA USING test") sql("SELECT * from tblA").show() ``` results in: ``` +---+---+ | x| y| +---+---+ | 0| 0| | 1| 1| | 2| 2| +---+---+ ``` _There are limitations and followups to make:_ 1. Statically loading Python Data Sources is still not supported (SPARK-45917) ### Why are the changes needed? In order for Python Data Source to be able to be used in all other place including SparkR, Scala together. ### Does this PR introduce _any_ user-facing change? Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc. ### How was this patch tested? Unittests were added, and manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
