SidGanesh41053 opened a new pull request, #53395: URL: https://github.com/apache/spark/pull/53395
### What changes were proposed in this pull request? This PR improves the error message when users attempt to use HDFS with `TransformWithState` and multiple column families. - Ticket: https://issues.apache.org/jira/browse/SPARK-51376 **Changes:** 1. Added a new error class `UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS` with a specific error message 2. Updated `HDFSBackedStateStoreProvider` to throw the new HDFS-specific error instead of the generic error in 4 locations: - `createColFamilyIfAbsent` method - `assertUseOfDefaultColFamily` method - `init` method - `getChangeDataReader` method 3. Updated the test in `ValueStateSuite` to expect the new error class and exception type **Before:** - Error: `UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES` - Message: "Creating multiple column families with <stateStoreProvider> is not supported." - Problem: Does not indicate that HDFS is the issue or provide guidance on how to fix it **After:** - Error: `UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS` - Message: "HDFS is not supported with TransformWithState when using multiple column families. Please use RocksDBStateStoreProvider." - Solution: Clearly states HDFS limitation and recommends RocksDB as the alternative ### Why are the changes needed? The current error message when using HDFS with `TransformWithState` and multiple column families is unhelpful: - It doesn't explicitly state that HDFS is not supported - It doesn't provide guidance on how to resolve the issue - Users are left confused about what to do next This change provides a clear, actionable error message that: - Explicitly identifies HDFS as the unsupported state store provider - Recommends using RocksDB as the solution - Improves user experience by reducing confusion and support requests ### Does this PR introduce _any_ user-facing change? **Yes.** **Previous behavior:** When using HDFS with `TransformWithState` and multiple column families, users would see: ``` UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES Creating multiple column families with <stateStoreProvider> is not supported. ``` **New behavior:** Users will now see: ``` UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS HDFS is not supported with TransformWithState when using multiple column families. Please use RocksDBStateStoreProvider. ``` This is a user-facing change that improves error messaging clarity and provides actionable guidance. ### How was this patch tested? 1. **Unit test updated**: The existing test `colFamily with HDFSBackedStateStoreProvider should fail` in `ValueStateSuite.scala` was updated to: - Expect the new exception type `StateStoreHDFSMultipleColumnFamiliesNotSupportedException` - Verify the new error class `UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS` - Verify the error message matches the expected text 2. **Manual verification**: The error message can be verified in spark-shell: ```scala import org.apache.spark.sql.execution.streaming.state.StateStoreErrors val ex = StateStoreErrors.hdfsMultipleColumnFamiliesNotSupported() println(ex.getMessage) // Output: HDFS is not supported with TransformWithState when using multiple column families. Please use RocksDBStateStoreProvider. ``` First, run ```./build/mvn -Dscalastyle.skip=true -Dcheckstyle.skip=true package``` Second, bring up Spark Shell with the following command: ``` ./bin/spark-shell --conf spark.driver.extraClassPath=$(pwd)/common/utils/target/original-spark-common-utils_2.13-4.0.1.jar:$(pwd)/sql/core/target/original-spark-sql_2.13-4.0.1.jar \ --conf spark.executor.extraClassPath=$(pwd)/common/utils/target/original-spark-common-utils_2.13-4.0.1.jar:$(pwd)/sql/core/target/original-spark-sql_2.13-4.0.1.jar ``` - Need to run this because when you build, the system produces updated JAR files in sql/core/target/ and common/utils/target/. However, if you just run vanilla spark-shell, it never sees your updated classes unless you point it at them. You are to include both driver and executor so that the Scala code in Spark Shell loads both standard distribution JARs which are already bundled inside assembly/ and the modified JARs we get from building ### Was this patch authored or co-authored using generative AI tooling? Yes (mainly helping with syntax, and not logically). This PR does not change any state store functionality, only the user-facing error message is updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
