codope commented on PR #9035: URL: https://github.com/apache/hudi/pull/9035#issuecomment-3830100081
> ### Questions > 1. Are there any blocking concerns or design decisions that prevented this from moving forward? Apart from the few unresolved comment threads, we need to consider two main concerns: - Overhead: i guess storage overhead is going to be minimal as markers will get cleaned up, but performance overhead with large writes due to markers for every file adds I/O operations. Each marker write includes serialization + CRC calculation + file creation. It is probably worth doing this work to avoid data duplication/corruptness you pointed. But, it would be good to know how much overhead it incurs compared to baseline. - TOCTOU issue raised in https://github.com/apache/hudi/pull/9035#pullrequestreview-2205683315 is a valid concern. I think `ignoreExisting` doesn't fully solve this. For cloud storage like S3, we need atomic conditional writes, or lease-based coordination through timeline server. > 2. Would it be acceptable for us to create a new PR based on this work with the necessary updates? Yes, totally fine. In fact, a lot of code has changed since this PR. It would be better to raise a new PR. > 3. Is there any overlap or conflict with the RFC in [[HUDI-7967][RFC][WIP] Robust spark writes rfc #11593](https://github.com/apache/hudi/pull/11593) (Robust Spark Writes)? @nsivabalan could add more. I think there is significant overlap but there was one thing that was proposed in the RFC - file id generation in the driver: https://github.com/apache/hudi/pull/11593/changes#diff-0de0b9940a382f28dbf83c9007047ea84e0a61c553dfb6b55b279a327cc7f159R116 - that needs to be considered. And i think we can extend the RFC providing formal spec and rationale (e.g. the scenario (with diagram) mentioned here https://github.com/apache/hudi/pull/9035#issuecomment-1607857303 can be included in the RFC and then we show how the problem is being solved. We should also discuss about the overhead and TOCTOU issue in the RFC. > We have been running a similar implementation internally and can validate that the approach works well in production. Looking forward to helping get this merged. That's great! In general, I am in favor of a design that trades theoretical race conditions for practical reliability improvements in the common case. But, it's important to have considered that in RFC and what alternatives exist for stricter guarantees (we would eventually have to move towards timeline server based coordination probably). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
