rdblue commented on issue #351: Provide an API to modify records within files URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-518439477 My high-level feedback is that I'd prefer to reuse `OverwriteFiles` and update it to expose the behavior required by this. If the required behavior is to fail if any file is added that matches the delete filter, then we can add a flag to set that like `validateAppendOnly` in `ReplacePartitions`. How about `validateNoConflictingAppends`? I think we want to make both behaviors available, but we should also consider making failure when a conflict is detected the default. `OverwriteFiles` currently implements an idempotent change: replace all data matching a filer with new data. The intent is for cases like overwriting an aggregation: you update the aggregation every hour and always produce a completely new copy independent of the data in the table. But, that use case is unlikely to run into a problem if `validateNoConflictingAppends` were the default. Instead of two concurrent runs both succeeding, one would fail. If we want to fail if there are conflicting appends by default, then we could add `allowConflictingAppends` instead. We would also want to decide whether the current overwrite behavior in Spark should allow or not allow conflicting appends. I think if the default is to not allow them, then we should go with that. We can add a write option to allow conflicting appends, like "is-idempotent". @aokolnychyi, what do you think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
