rdblue commented on issue #351: Provide an API to modify records within files URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-519167629 > we need to track the base snapshot id when the data was read and it should not change during retries Do we need to add this to the constructor? I think it should be possible to start an overwrite, then scan the same table state, where both operations use the current snapshot ID. If we want to be explicit about the snapshot that is read, we could alternatively add a method to set it. > we will need some sort of overwriteFiles(deletedFiles, addedFiles) or deleteFile(file) I think it makes sense to add `deleteFile`. This is already supported in the implementation, `MergingSnapshotProducer` that is used for almost all operations. > in certain cases, we cannot convert all filters in query engines into equivalent Iceberg filters In these cases, the validation must be that the current snapshot ID has not changed, right? In that case, we just need to specify that's the requirement with something like `validateNoConflictingAppends()` -- if there is no filter then it means any write in the entire table would conflict. An alternative is to use a more generic filter. We can safely drop `and` predicates if they can't be converted. For example, if the filter is `date(ts) = '2019-06-01' AND hour(ts) = 10`, the second predicate can't be converted (there is no hour-of-day transform). But, we can make sure that none of the data for the entire day has changed instead of just hour 10. > we cannot use the expression passed to overwriteByRowFilter for conflict resolution . . . validateNoConflictingAppends will probably need to accept a row filter as well. Agreed. Passing a filter here would work well. > validateAddedFiles that ensures that all files added by OverwriteFiles match the predicate passed in overwriteByRowFilter We can update docs and the name as well. Maybe it should be `validateAddedFilesMatchRowFilter`. In addition, I think we might want to add preconditions that check to make sure this method isn't used with other configuration methods that may conflict.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
