Hello structured streaming experts!

We are getting SST FileNotFound state store corruption errors.

The root cause is a race condition where two different executors are doing
cleanup of the state store at the same time.  Both write the version of the
state zip file to DFS.

The first executor enters maintenance, writes SST files and writes the
636.zip file.

Concurrently the second executor enters maintenance, writes SST files and
writes 636.zip.

The SST files in dfs are written almost simultaneously and are:
   - 000867-4695ff6e-d69d-4792-bcd6-191b57eadb9d.sst   <-- from one executor
   - 000867-a71cb8b2-9ed8-4ec1-82e2-a406dd1fb949.sst   <-- from other
executor

The problem occurs during orphan file deletion (see this PR
<https://github.com/apache/spark/pull/39897/files>). The executor that lost
the race to write 636.zip decides that it will delete the SST file that is
actually referenced in 636.zip.

Currently the maintenance task does have some protection for files of
ongoing tasks. This is from the comments in RocksDBFileManager: "only
delete orphan files that are older than all tracked files when there are at
least 2 versions".

In this case the file that is being deleted is indeed older but only by a
small amount of time.

Instead of simply older, should there be some padding to allow for
maintenance being executed simultaneously on two executors?  Something like
at least 60s older than the oldest tracked file.

This should help avoid state store corruption at the expense of some
storage space in the DFS.

Thanks for any help or recommendations here.

Reply via email to