aokolnychyi commented on issue #351: Provide an API to modify records within files URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-519237696 > Do we need to add this to the constructor? I think it should be possible to start an overwrite, then scan the same table state, where both operations use the current snapshot ID. Could you elaborate a bit? I am not sure I got the use case. In general, knowing the snapshot that was read is essential to guarantee serializable isolation in case of updates/deletes. It also has to stay the same during retries as we have to analyze every operation that happened after we read the data. The main motivation to accept it in the constructor is to avoid asking the user to set one more parameter right. Without the correct base snapshot, `validateNoConflictingAppends` is useless. I have a test for such a use case [here](https://github.com/apache/incubator-iceberg/pull/351/files#diff-96ff93512d7be21e69f35e2bc96f03e9R525). For existing use cases with `overwriteByRowFilter`, it should not have any impact and is also not visible to the user. > In these cases, the validation must be that the current snapshot ID has not changed, right? Not necessarily, I guess. If there is a delete that removed other files (i.e. not the ones that we are rewriting), it should be OK to commit the update. So, I would probably rely on a more generic row filter being set (we do this internally now). Based on the discussion, the API can look like: ``` OverwriteFiles overwriteByRowFilter(Expression expr); OverwriteFiles addFile(DataFile file); OverwriteFiles deleteFile(DataFile file); OverwriteFiles validateAddedFilesMatchRowFilter(); OverwriteFiles validateNoConflictingAppends(Expression expr); ``` By default, the expr for `validateNoConflictingAppends` can be set to `false` to match the current behaviour. As of now, if we are rewriting via `overwriteByRowFilter`, `OverwriteFiles` will successfully retry and delete matching files added in a concurrent transaction.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
