nbalajee opened a new pull request, #9035: URL: https://github.com/apache/hudi/pull/9035
…retries ### Change Logs During spark stage retries, spark driver may have all the information to reconcile the commit and proceed with next steps, while a stray executor may still be writing to a data file and complete later (beyond reconcile step, before the JVM exit). Extra files left on the dataset, excluded from reconcile commit step could show up as data quality issue for query engines with duplicate records. This change brings completion markers which tries to prevent the dataset from experiencing data quality issues, in such corner case scenarios. A planned future change, would prevent the second/subsequent tasks/executors from creating additional files (with a different write token) and reuse the successfully completed files. ### Impact Improved reliability, data quality due to infrastructure related failures, resulting in stage/task retries. ### Risk level (write none, low medium or high below) Low/Medium: This change has been in production for about a year now at Uber. ### Documentation Update ENFORCE_COMPLETION_MARKER_CHECKS - Allows configuring whether to fail the job or continue with retries, when an already completed file is being retried. With a planned change, this would allow the second/subsequent attempt to create a file to succeed using the previously created copy of data. ENFORCE_FINALIZE_WRITE_CHECK - Allows configuring whether to fail the job if commit reconciliation step has been completed and the write stage is retried (say a block from writeStatus RDD is found to be lost, when iterating over write statues for record index update). Single executor writing to multiple files (data spilling over to more than one file) and stage failure post reconciliation results in data quality issues. This flag is helpful in failing the job, instead of creating incorrect commit. ### Contributor's checklist - [x ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
