HyukjinKwon opened a new pull request, #44233:
URL: https://github.com/apache/spark/pull/44233

   ### What changes were proposed in this pull request?
   
   This PR is another approach of https://github.com/apache/spark/pull/43784 
which proposes to support Python Data Source can be with SQL (in favour of 
https://github.com/apache/spark/pull/43949), SparkR and all other exiting 
combinations by wrapping the Python Data Source by DSv2 interface (but yet uses 
`V1Table` interface).
   
   The approach is: one Python Data Source wrapper looks up Python data sources
   
   Self-contained working example:
   
   ```python
   from pyspark.sql.datasource import DataSource, DataSourceReader, 
InputPartition
   
   class TestDataSourceReader(DataSourceReader):
       def __init__(self, options):
           self.options = options
       def partitions(self):
           return [InputPartition(i) for i in range(3)]
       def read(self, partition):
           yield partition.value, str(partition.value)
   
   class TestDataSource(DataSource):
       @classmethod
       def name(cls):
           return "test"
       def schema(self):
           return "x INT, y STRING"
       def reader(self, schema) -> "DataSourceReader":
           return TestDataSourceReader(self.options)
   ```
   
   ```python
   spark.dataSource.register(TestDataSource)
   sql("CREATE TABLE tblA USING test")
   sql("SELECT * from tblA").show()
   ```
   
   results in:
   
   ```
   +---+---+
   |  x|  y|
   +---+---+
   |  0|  0|
   |  1|  1|
   |  2|  2|
   +---+---+
   ```
   
   _There are limitations and followups to make:_
   
   1. Statically loading Python Data Sources is still not supported 
(SPARK-45917)
   
   ### Why are the changes needed?
   
   In order for Python Data Source to be able to be used in all other place 
including SparkR, Scala together.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Users can register their Python Data Source, and use them in SQL, 
SparkR, etc.
   
   ### How was this patch tested?
   
   Unittests were added, and manually tested.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to