PorridgeSwim opened a new pull request, #56459:
URL: https://github.com/apache/spark/pull/56459

   ### What changes were proposed in this pull request?
   
   Re-lands [SPARK-56975](https://issues.apache.org/jira/browse/SPARK-56975) 
(#56017, reverted in #56189) as a complete, config-gated change. 
`DataStreamReader.table()` now rejects a user-specified schema set via 
`.schema(...)` instead of silently ignoring it:
   
   - New internal conf 
`spark.sql.streaming.disallowUserSpecifiedSchemaInTable.enabled` (default 
`true`); setting it to `false` restores the previous silent-ignore behavior.
   - New error condition `STREAMING_USER_SPECIFIED_SCHEMA_NOT_ALLOWED_IN_TABLE` 
(sqlState `0A000`) whose message instructs the user to remove the 
`.schema(...)` call or flip the conf. The error helper lives in 
`CompilationErrors` (sql/api) so both Classic and the Connect Scala client 
share it.
   - **Classic**: check in `classic/DataStreamReader.table()`. Classic PySpark 
is covered automatically via the JVM.
   - **Connect**: the user-specified schema is dropped client-side 
(`Read.NamedTable` proto has no schema field, so the server never sees it), so 
the check lives in the Scala and Python Connect clients. The conf is fetched 
from the server only when a schema was actually set, so the common path adds no 
extra RPC.
   - Migration guide entry in `docs/streaming/ss-migration-guide.md` 
("Upgrading from Structured Streaming 4.2 to 5.0").
   
   This supersedes the unmergeable follow-up #56153.
   
   ### Why are the changes needed?
   
   Users can write `readStream.schema(s).table(name)`, see a working query, and 
reasonably assume `s` took effect - when in fact it is dropped and the catalog 
schema is used. Since catalog tables declare their own schemas, specifying one 
via `.schema()` is always a misconfiguration. Other `DataStreamReader` paths 
already surface this (`.load()` throws for unsupported providers; `.changes()` 
calls `assertNoSpecifiedSchema()`); `.table()` was the silent exception.
   
   The original fix (#56017) was reverted because it was an ungated 
behavior-breaking change. This re-land addresses the review feedback: the new 
behavior is gated behind a conf that can be turned off, and the error message 
tells the user exactly how to fix or opt out.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Calling `DataStreamReader.schema(...)` before 
`DataStreamReader.table(tableName)` now throws an `AnalysisException` with 
error condition `STREAMING_USER_SPECIFIED_SCHEMA_NOT_ALLOWED_IN_TABLE` by 
default, in both Spark Classic and Spark Connect (Scala and Python). Previously 
the user-specified schema was silently ignored. Setting 
`spark.sql.streaming.disallowUserSpecifiedSchemaInTable.enabled` to `false` 
restores the previous behavior. Compared to released Spark versions this is a 
new behavior change; within master it re-lands a previously reverted change, 
now config-gated.
   
   ### How was this patch tested?
   
   New/updated tests:
   - `DataStreamTableAPISuite`: rejection with the new error condition, and 
silent-ignore behavior when the conf is off (in both cases the query runs fine 
with the catalog schema once `.schema(...)` is removed/ignored).
   - `ClientStreamingQuerySuite` (Connect Scala E2E): rejection by default and 
catalog schema used when the conf is off.
   - `pyspark.sql.tests.streaming.test_streaming`: same coverage for Python; 
runs against Classic and against Connect via `test_parity_streaming`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-fable-5)
   
   This pull request and its description were written by Isaac.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to