ericm-db opened a new pull request, #53898:
URL: https://github.com/apache/spark/pull/53898
### What changes were proposed in this pull request?
This PR adds the `name()` method to Classic PySpark's `DataStreamReader`
class. This method allows users to specify a name for streaming sources, which
is used in checkpoint metadata and enables stable checkpoint locations for
source evolution.
Changes include:
- Add `name()` method to `DataStreamReader` in
`python/pyspark/sql/streaming/readwriter.py`
- Add comprehensive test suite in
`python/pyspark/sql/tests/streaming/test_streaming_reader_name.py`
- Update compatibility test to mark `name` as currently missing from Connect
(until the Connect PR merges)
The method validates that the source_name contains only ASCII letters,
digits, and underscores, raising `PySparkTypeError` or `PySparkValueError` for
invalid inputs.
### Why are the changes needed?
This brings Classic PySpark to feature parity with the Scala/Java API for
streaming source naming. The `name()` method is essential for:
1. Identifying sources in checkpoint metadata
2. Enabling stable checkpoint locations during source evolution
3. Providing consistency across Classic and Connect implementations
### Does this PR introduce _any_ user-facing change?
Yes. Users can now call `.name()` on DataStreamReader in Classic PySpark:
```python
spark.readStream.format("parquet").name("my_source").load("/path")
```
### How was this patch tested?
- Added comprehensive unit tests in `test_streaming_reader_name.py` covering:
- Valid name patterns (letters, digits, underscores)
- Invalid names (hyphens, spaces, dots, special characters, empty strings,
None, wrong types)
- Method chaining
- Different data formats (parquet, json)
- Integration with streaming queries
- Updated compatibility tests to account for the current state where Classic
has `name` but Connect doesn't yet
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]