Adam Binford created SPARK-43951:
------------------------------------
Summary: RocksDB state store can become corrupt on task retries
Key: SPARK-43951
URL: https://issues.apache.org/jira/browse/SPARK-43951
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.4.0
Reporter: Adam Binford
A couple of our streaming jobs have failed since upgrading to Spark 3.4 with an
error such as:
org.rocksdb.RocksDBException: Mismatch in unique ID on table file ###.
Expected: [###,###} Actual\{###,###} in file ..../MANIFEST-####
This is due to the change from
[https://github.com/facebook/rocksdb/commit/6de7081cf37169989e289a4801187097f0c50fae]
that enabled unique ID checks by default, and I finally tracked down the exact
sequence of steps that leads to this failure in the way RocksDB state store is
used.
# A task fails after uploading the checkpoint to HDFS. Lets say it uploaded
11.zip to version 11 of the table, but the task failed before it could finish
after successfully uploading the checkpoint.
# The same task is retried and goes back to load version 10 of the table as
expected.
# Cleanup/maintenance is called for this partition, which looks in HDFS for
persisted versions and sees up through version 11 since that zip file was
successfully uploaded on the previous task.
# As part of resolving what SST files are part of each table version,
versionToRocksDBFiles.put(version, newResolvedFiles) is called for version 11
with its SST files that were uploaded in the first failed task.
# The second attempt at the task commits and goes to sync its checkpoint to
HDFS.
# versionToRocksDBFiles contains the SST files to upload from step 4, and
these files are considered "the same" as what's in the local working dir
because the name and file size match.
# No SST files are uploaded because they matched above, but in reality the
unique ID inside the SST files is different (presumably this is just randomly
generated and inserted into each SST file?), it just doesn't affect the size.
# A new METADATA file is uploaded which has the new unique IDs listed inside.
# When version 11 of the table is read during the next batch, the unique IDs
in the METADATA file don't match the unique IDS in the SST files, which causes
the exception.
This is basically a ticking time bomb for anyone using RocksDB. Thoughts on
possible fixes would be:
* Disable unique ID verification. I don't currently see a binding for this in
the RocksDB java wrapper, so that would probably have to be added first.
* Disable checking if files are already uploaded with the same size, and just
always upload SST files no matter what.
* Update the "same file" check to also be able to do some kind of CRC
comparison or something like that.
* Update the mainteance/cleanup to not update the versionToRocksDBFiles map.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]