prashantwason commented on PR #9545: URL: https://github.com/apache/hudi/pull/9545#issuecomment-1702595033
@nsivabalan Did you consider adding a new command block type which could work as a commit marker? Lets assume commit C1 was to add 2 log blocks to a log file. Lets assume the log file already has the following content (I am assuming appends enabled on the log file for simplicity here but this should work with append disabled too). Current log file: [log_block_c0_1] So now commit C1 will add 2 log blocks resulting in: Current log file: [log_block_c0_1, log_block_c1_1, log_block_c1_2] The issue you have is that if the Spark stage retries lead to repeated writes of log_block_c1_1, log_block_c1_2. Lets assume that all writes of log blocks should end with a valid commit command block: log file: [log_block_c0_1, COMMIT_COMMAND_BLOCK_C0, log_block_c1_1, log_block_c1_2, COMMIT_COMMAND_BLOCK_C1] If a valid commit command block is not found then the preceding blocks are not valid. If multiple COMMIT_COMMAND_BLOCK_XX are found then the reader can choose the last one. This idea is similar to how databases use START_COMMIT and END_COMMIT markers in WAL (write-ahead-log) etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
