[PR] [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL [spark]

via GitHub Mon, 13 Nov 2023 03:36:21 -0800


HyukjinKwon opened a new pull request, #43784:
URL: https://github.com/apache/spark/pull/43784


   ### What changes were proposed in this pull request?
   
   This PR proposes to support Python Data Source can be with SQL, SparkR and 
all other exiting combinations by wrapping the Python Data Source by DSv1 
interface.
   
   The approach is as follows:
   
   1. PySpark registers a Python Data Source with its short name.
   2. Later, when the Data Sources are looked up, JVM creates a class that 
inherits DSv1 class dynamically that has the same short name as registered in 
Python.
   3. The returned class invokes Python Data Source, and works wherever it 
works with DSv1 even including SparkR, Scala, all SQL places.
   
   For example as below:
   
   ```python
   from pyspark.sql.datasource import DataSource, DataSourceReader
   
   class SimpleDataSourceReader(DataSourceReader):
       def __init__(self, paths, options):
           self.paths = paths
           self.options = options
   
       def partitions(self):
           return iter(self.paths)
   
       def read(self, path):
           yield (path, 1)
   
   class SimpleDataSource(DataSource):
       @classmethod
       def name(cls) -> str:
           return "test"
   
       def schema(self) -> str:
           return "id STRING, value INT"
   
       def reader(self, schema):
           return SimpleDataSourceReader(self.paths, self.options)
   ```
   
   ```python
   spark.dataSource.register(SimpleDataSource)
   sql("CREATE TABLE tblA USING test")
   sql("SELECT value from tblA").show()
   ```
   
   results in:
   
   ```
   +-----+
   |value|
   +-----+
   |    1|
   +-----+
   ```
   
   _There are limitations and followups to make:_
   
   1. We should change the dynamically generated classname from 
`org.apache.spark.sql.execution.datasources.PythonTableScan` to something else 
that maps to individual Python Data Source so the classes are not confused.
   2. Whenever you load Python Data Source, it creates a new class dynamically 
generated.
   3. If you save this table after you restart your driver, the table cannot be 
loaded because the dynamically generated class does not exist anymore. Should 
figure out the way of reloading them.
   4. Multi-paths are not supported (inherited from DSv1)
   5. Using a wrapper of DSv1 might be a blocker to implement commit protocol 
in Python Data Source.
   
   ### Why are the changes needed?
   
   In order for Python Data Source to be able to be used in all other place 
including SparkR, Scala together.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Users can register their Python Data Source, and use them in SQL, 
SparkR, etc.
   
   ### How was this patch tested?
   
   Unittests were added, and manually tested.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL [spark]

Reply via email to