Adam Binford created SPARK-43951:
------------------------------------

             Summary: RocksDB state store can become corrupt on task retries
                 Key: SPARK-43951
                 URL: https://issues.apache.org/jira/browse/SPARK-43951
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.0
            Reporter: Adam Binford


A couple of our streaming jobs have failed since upgrading to Spark 3.4 with an 
error such as:

org.rocksdb.RocksDBException: Mismatch in unique ID on table file ###. 
Expected: [###,###} Actual\{###,###} in file ..../MANIFEST-####

This is due to the change from 
[https://github.com/facebook/rocksdb/commit/6de7081cf37169989e289a4801187097f0c50fae]
 that enabled unique ID checks by default, and I finally tracked down the exact 
sequence of steps that leads to this failure in the way RocksDB state store is 
used.
 # A task fails after uploading the checkpoint to HDFS. Lets say it uploaded 
11.zip to version 11 of the table, but the task failed before it could finish 
after successfully uploading the checkpoint.
 # The same task is retried and goes back to load version 10 of the table as 
expected.
 # Cleanup/maintenance is called for this partition, which looks in HDFS for 
persisted versions and sees up through version 11 since that zip file was 
successfully uploaded on the previous task.
 # As part of resolving what SST files are part of each table version, 
versionToRocksDBFiles.put(version, newResolvedFiles) is called for version 11 
with its SST files that were uploaded in the first failed task.
 # The second attempt at the task commits and goes to sync its checkpoint to 
HDFS.
 # versionToRocksDBFiles contains the SST files to upload from step 4, and 
these files are considered "the same" as what's in the local working dir 
because the name and file size match.
 # No SST files are uploaded because they matched above, but in reality the 
unique ID inside the SST files is different (presumably this is just randomly 
generated and inserted into each SST file?), it just doesn't affect the size.
 # A new METADATA file is uploaded which has the new unique IDs listed inside.
 # When version 11 of the table is read during the next batch, the unique IDs 
in the METADATA file don't match the unique IDS in the SST files, which causes 
the exception.

 

This is basically a ticking time bomb for anyone using RocksDB. Thoughts on 
possible fixes would be:
 * Disable unique ID verification. I don't currently see a binding for this in 
the RocksDB java wrapper, so that would probably have to be added first.
 * Disable checking if files are already uploaded with the same size, and just 
always upload SST files no matter what.
 * Update the "same file" check to also be able to do some kind of CRC 
comparison or something like that.
 * Update the mainteance/cleanup to not update the versionToRocksDBFiles map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to