codope commented on PR #9035:
URL: https://github.com/apache/hudi/pull/9035#issuecomment-3830100081

   > ### Questions
   > 1. Are there any blocking concerns or design decisions that prevented this 
from moving forward?
   
   Apart from the few unresolved comment threads, we need to consider two main 
concerns:
   - Overhead: i guess storage overhead is going to be minimal as markers will 
get cleaned up, but performance overhead with large writes due to markers for 
every file adds I/O operations. Each marker write includes serialization + CRC 
calculation + file creation. It is probably worth doing this work to avoid data 
duplication/corruptness you pointed. But, it would be good to know how much 
overhead it incurs compared to baseline.
   - TOCTOU issue raised in 
https://github.com/apache/hudi/pull/9035#pullrequestreview-2205683315 is a 
valid concern. I think `ignoreExisting` doesn't fully solve this. For cloud 
storage like S3, we need atomic conditional writes, or lease-based coordination 
through timeline server.
   
   > 2. Would it be acceptable for us to create a new PR based on this work 
with the necessary updates?
   
   Yes, totally fine. In fact, a lot of code has changed since this PR. It 
would be better to raise a new PR.
   
   > 3. Is there any overlap or conflict with the RFC in [[HUDI-7967][RFC][WIP] 
Robust spark writes rfc #11593](https://github.com/apache/hudi/pull/11593) 
(Robust Spark Writes)?
   
   @nsivabalan could add more. I think there is significant overlap but there 
was one thing that was proposed in the RFC - file id generation in the driver: 
https://github.com/apache/hudi/pull/11593/changes#diff-0de0b9940a382f28dbf83c9007047ea84e0a61c553dfb6b55b279a327cc7f159R116
 - that needs to be considered. And i think we can extend the RFC providing 
formal spec and rationale (e.g. the scenario (with diagram) mentioned here 
https://github.com/apache/hudi/pull/9035#issuecomment-1607857303 can be 
included in the RFC and then we show how the problem is being solved. We should 
also discuss about the overhead and TOCTOU issue in the RFC.
   
   > We have been running a similar implementation internally and can validate 
that the approach works well in production. Looking forward to helping get this 
merged.
   
   That's great! In general, I am in favor of a design that trades theoretical 
race conditions for practical reliability improvements in the common case. But, 
it's important to have considered that in RFC and what alternatives exist for 
stricter guarantees (we would eventually have to move towards timeline server 
based coordination probably).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to