divijvaidya commented on PR #14242: URL: https://github.com/apache/kafka/pull/14242#issuecomment-1778956865
Hey @ocadaruma This is an important change, so I despite low engagement on this from me, I do believe that this change is critical. Let's think carefully about this. I would like to discuss the changes individually. Let's start with (1) The side-effect of moving producer snapshot flush to async thread is that it while earlier it was guaranteed that producer snapshot is present and consistent when segment (and others) are flushed. If for some reason, the producer fsync failed, we would not have scheduled a flush for segment and friends. But now, since we are flushing snapshot async & quietly, it might be possible that we have segment and indexes on disk but we don't have a producer snapshot. This is ok for server restart, because on restart, we will rebuild the snapshot by scanning last few segments. This is ok even if the server doesn't restart because we will be using in-memory value of producer state and we might take another snapshot associated with next segment later. To summarize, Kafka does not expect producer snapshot on disk to be strongly consistent with rest of files such as log segment and transaction index. @jolshan (as an expert on trx index and producer snapshot) do you agree with this statement? If we agree on this, then (1) is a safe change IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
