nsivabalan commented on issue #4778: URL: https://github.com/apache/hudi/issues/4778#issuecomment-1039519922
yes. unfortunately, this behavior can't be fixed easily. For eg, 1. insert rec1 at time t0 2. delete rec1 at time t100 3. again re-insert rec1 at time t100000 lets say between (2) and (3), there were 100s of commits, hudi may not keep remembering every record it has ever seen. So, after (2), whenever next update happens, hudi will remove rec1 from its storage. and so later if rec1 is ingested again, hudi will consider it as an insert record. If not, hudi has to keep track of every record that got inserted and deleted forever. Dont' think makes sense for a large analytical storage system. wrt your statement "But the target table will contain a row with the values from the delete, whereas this row should not be inserted into the target table in any way (same as above)." : the delete record will be ingested to hudi, but will have "_hoodie_is_deleted" set to true. But during next merge or compaction, the record will be removed. This record will be part of storage only for a short duration. Also, there are other ways to trigger deletes. Please check out the details [here](https://hudi.apache.org/blog/2020/01/15/delete-support-in-hudi/). not all of them storage the value with _hoodie_is_deleted as true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
