ableegoldman opened a new pull request #8962: URL: https://github.com/apache/kafka/pull/8962
Two more edge cases I found producing extra TaskcorruptedException while playing around with the failing eos-beta upgrade test (sadly these are unrelated problems, as the test still fails with these fixes in place). 1. Need to write the checkpoint when recycling a standby: although we do preserve the changelog offsets when recycling a task, and should therefore write the offsets when the new task is itself closed, we do NOT write the checkpoint for uninitialized tasks. So if the new task is ultimately closed before it gets out of the CREATED state, the offsets will not be written and we can get a TaskCorruptedException 2. With the change in task locking to address some Windows-related nonsense (am I remembering that correctly?), we don't delete entire task directories but just clear the inner state. With EOS, during initialization we check if the state directory is non-empty and the checkpoint is missing, and throw a TaskCorrupted if so. But just opening a rocksdb store creates a `rocksdb` base dir in the task directory, so the `taskDirIsEmpty` check always fails and we always throw TaskCorrupted even if there's nothing there. We can fix 2. for rocksdb specifically, but this might still cause a headache for users of custom stores. Note that it's not a correctness issue, just an annoyance, so my take is that we should avoid large last-minute changes and just fix for rocksdb in 2.6. Then we can consider a more holistic fix going forward ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org