PorridgeSwim opened a new pull request, #56459: URL: https://github.com/apache/spark/pull/56459
### What changes were proposed in this pull request? Re-lands [SPARK-56975](https://issues.apache.org/jira/browse/SPARK-56975) (#56017, reverted in #56189) as a complete, config-gated change. `DataStreamReader.table()` now rejects a user-specified schema set via `.schema(...)` instead of silently ignoring it: - New internal conf `spark.sql.streaming.disallowUserSpecifiedSchemaInTable.enabled` (default `true`); setting it to `false` restores the previous silent-ignore behavior. - New error condition `STREAMING_USER_SPECIFIED_SCHEMA_NOT_ALLOWED_IN_TABLE` (sqlState `0A000`) whose message instructs the user to remove the `.schema(...)` call or flip the conf. The error helper lives in `CompilationErrors` (sql/api) so both Classic and the Connect Scala client share it. - **Classic**: check in `classic/DataStreamReader.table()`. Classic PySpark is covered automatically via the JVM. - **Connect**: the user-specified schema is dropped client-side (`Read.NamedTable` proto has no schema field, so the server never sees it), so the check lives in the Scala and Python Connect clients. The conf is fetched from the server only when a schema was actually set, so the common path adds no extra RPC. - Migration guide entry in `docs/streaming/ss-migration-guide.md` ("Upgrading from Structured Streaming 4.2 to 5.0"). This supersedes the unmergeable follow-up #56153. ### Why are the changes needed? Users can write `readStream.schema(s).table(name)`, see a working query, and reasonably assume `s` took effect - when in fact it is dropped and the catalog schema is used. Since catalog tables declare their own schemas, specifying one via `.schema()` is always a misconfiguration. Other `DataStreamReader` paths already surface this (`.load()` throws for unsupported providers; `.changes()` calls `assertNoSpecifiedSchema()`); `.table()` was the silent exception. The original fix (#56017) was reverted because it was an ungated behavior-breaking change. This re-land addresses the review feedback: the new behavior is gated behind a conf that can be turned off, and the error message tells the user exactly how to fix or opt out. ### Does this PR introduce _any_ user-facing change? Yes. Calling `DataStreamReader.schema(...)` before `DataStreamReader.table(tableName)` now throws an `AnalysisException` with error condition `STREAMING_USER_SPECIFIED_SCHEMA_NOT_ALLOWED_IN_TABLE` by default, in both Spark Classic and Spark Connect (Scala and Python). Previously the user-specified schema was silently ignored. Setting `spark.sql.streaming.disallowUserSpecifiedSchemaInTable.enabled` to `false` restores the previous behavior. Compared to released Spark versions this is a new behavior change; within master it re-lands a previously reverted change, now config-gated. ### How was this patch tested? New/updated tests: - `DataStreamTableAPISuite`: rejection with the new error condition, and silent-ignore behavior when the conf is off (in both cases the query runs fine with the catalog schema once `.schema(...)` is removed/ignored). - `ClientStreamingQuerySuite` (Connect Scala E2E): rejection by default and catalog schema used when the conf is off. - `pyspark.sql.tests.streaming.test_streaming`: same coverage for Python; runs against Classic and against Connect via `test_parity_streaming`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-fable-5) This pull request and its description were written by Isaac. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
