[PR] Spark 51376 - TransformWithState should prescribe RocksDB in its error message when HDFS is in use [spark]

via GitHub Mon, 08 Dec 2025 14:42:12 -0800


SidGanesh41053 opened a new pull request, #53395:
URL: https://github.com/apache/spark/pull/53395


   ### What changes were proposed in this pull request?
   
   This PR improves the error message when users attempt to use HDFS with 
`TransformWithState` and multiple column families. 
   - Ticket: https://issues.apache.org/jira/browse/SPARK-51376
   
   **Changes:**
   1. Added a new error class 
`UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS` with a specific 
error message
   2. Updated `HDFSBackedStateStoreProvider` to throw the new HDFS-specific 
error instead of the generic error in 4 locations:
      - `createColFamilyIfAbsent` method
      - `assertUseOfDefaultColFamily` method  
      - `init` method
      - `getChangeDataReader` method
   3. Updated the test in `ValueStateSuite` to expect the new error class and 
exception type
   
   **Before:**
   - Error: `UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES`
   - Message: "Creating multiple column families with <stateStoreProvider> is 
not supported."
   - Problem: Does not indicate that HDFS is the issue or provide guidance on 
how to fix it
   
   **After:**
   - Error: `UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS`
   - Message: "HDFS is not supported with TransformWithState when using 
multiple column families. Please use RocksDBStateStoreProvider."
   - Solution: Clearly states HDFS limitation and recommends RocksDB as the 
alternative
   
   ### Why are the changes needed?
   
   The current error message when using HDFS with `TransformWithState` and 
multiple column families is unhelpful:
   - It doesn't explicitly state that HDFS is not supported
   - It doesn't provide guidance on how to resolve the issue
   - Users are left confused about what to do next
   
   This change provides a clear, actionable error message that:
   - Explicitly identifies HDFS as the unsupported state store provider
   - Recommends using RocksDB as the solution
   - Improves user experience by reducing confusion and support requests
   
   ### Does this PR introduce _any_ user-facing change?
   
   **Yes.**
   
   **Previous behavior:**
   When using HDFS with `TransformWithState` and multiple column families, 
users would see:
   ```
   UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES
   Creating multiple column families with <stateStoreProvider> is not supported.
   ```
   
   **New behavior:**
   Users will now see:
   ```
   UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS
   HDFS is not supported with TransformWithState when using multiple column 
families. Please use RocksDBStateStoreProvider.
   ```
   
   This is a user-facing change that improves error messaging clarity and 
provides actionable guidance.
   
   ### How was this patch tested?
   
   1. **Unit test updated**: The existing test `colFamily with 
HDFSBackedStateStoreProvider should fail` in `ValueStateSuite.scala` was 
updated to:
      - Expect the new exception type 
`StateStoreHDFSMultipleColumnFamiliesNotSupportedException`
      - Verify the new error class 
`UNSUPPORTED_FEATURE.STATE_STORE_MULTIPLE_COLUMN_FAMILIES_HDFS`
      - Verify the error message matches the expected text
   
   2. **Manual verification**: The error message can be verified in spark-shell:
      ```scala
      import org.apache.spark.sql.execution.streaming.state.StateStoreErrors
      val ex = StateStoreErrors.hdfsMultipleColumnFamiliesNotSupported()
      println(ex.getMessage)
      // Output: HDFS is not supported with TransformWithState when using 
multiple column families. Please use RocksDBStateStoreProvider.
      ```
   
   First, run ```./build/mvn -Dscalastyle.skip=true -Dcheckstyle.skip=true 
package```
   Second, bring up Spark Shell with the following command:
   ```
   ./bin/spark-shell
     --conf 
spark.driver.extraClassPath=$(pwd)/common/utils/target/original-spark-common-utils_2.13-4.0.1.jar:$(pwd)/sql/core/target/original-spark-sql_2.13-4.0.1.jar
 \
     --conf 
spark.executor.extraClassPath=$(pwd)/common/utils/target/original-spark-common-utils_2.13-4.0.1.jar:$(pwd)/sql/core/target/original-spark-sql_2.13-4.0.1.jar
   ```
   
   - Need to run this because when you build, the system produces updated JAR 
files in sql/core/target/ and common/utils/target/. However, if you just run 
vanilla spark-shell, it never sees your updated classes unless you point it at 
them. You are to include both driver and executor so that the Scala code in 
Spark Shell loads both standard distribution JARs which are already bundled 
inside assembly/ and the modified JARs we get from building
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes (mainly helping with syntax, and not logically). This PR does not change 
any state store functionality, only the user-facing error message is updated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark 51376 - TransformWithState should prescribe RocksDB in its error message when HDFS is in use [spark]

Reply via email to