[
https://issues.apache.org/jira/browse/SPARK-47793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jungtaek Lim reassigned SPARK-47793:
------------------------------------
Assignee: Chaoqin Li
> Implement SimpleDataSourceStreamReader for python streaming data source
> -----------------------------------------------------------------------
>
> Key: SPARK-47793
> URL: https://issues.apache.org/jira/browse/SPARK-47793
> Project: Spark
> Issue Type: New Feature
> Components: PySpark, SS
> Affects Versions: 3.5.1
> Reporter: Chaoqin Li
> Assignee: Chaoqin Li
> Priority: Major
> Labels: pull-request-available
>
> SimpleDataSourceStreamReader is a simplified version of the DataStreamReader
> interface.
> # It doesn’t require developers to reason about data partitioning.
> # It doesn’t require getting the latest offset before reading data.
> There are 3 functions that needs to be defined
> 1. Read data and return the end offset.
> _def read(self, start: Offset) -> (Iterator[Tuple], Offset)_
> 2. Read data between start and end offset, this is required for exactly once
> read.
> _def read2(self, start: Offset, end: Offset) -> Iterator[Tuple]_
> 3. initial start offset of the streaming query.
> def initialOffset() -> dict
> Implementation: Wrap the SimpleDataSourceStreamReader instance in a
> DataSourceStreamReader internally and make the prefetching and caching
> transparent to the data source developer. The record prefetched in python
> process will be sent to JVM as arrow record batches.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]