rangadi opened a new pull request, #40586:
URL: https://github.com/apache/spark/pull/40586
### What changes were proposed in this pull request?
This adds core streaming API support for Spark Connect. With this, we can
run majority of streaming queries. All the sources and syncs are supported.
Most of the aggregations are supported.
Examples of features that are not yet supported: APIs that run user codes
like streaming listener, `foreachBatch()` API etc.
The remaining missing APIs will be added soon.
How to try it in local mode (`./bin/pyspark --remote "local[*]"`):
```
>>>
>>> query = (
... spark
... .readStream
... .format("rate")
... .option("numPartitions", "1")
... .load()
... .withWatermark("timestamp", "1 minute")
... .groupBy(window("timestamp", "10 seconds"))
... .count() # count for each 10 sedonds.
... .writeStream
... .format("memory")
... .queryName("rate_table")
... .trigger(processingTime="10 seconds")
... .start()
... )
>>>
>>> query.isActive
True
>>>
>>> >>> spark.sql("select window.start, count from rate_table").show()
+-------------------+-----+
| start|count|
+-------------------+-----+
|2023-03-11 22:45:40| 6|
|2023-03-11 22:45:50| 10|
+-------------------+-----+
```
### Does this PR introduce _any_ user-facing change?
This is needed to run streaming queries over Spark Connect.
### How was this patch tested?
- Manually tested.
- We will enable most of streaming python tests in follow up PRs.
-
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]