rdblue commented on issue #351: Provide an API to modify records within files
URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-519167629
 
 
   > we need to track the base snapshot id when the data was read and it should 
not change during retries
   
   Do we need to add this to the constructor? I think it should be possible to 
start an overwrite, then scan the same table state, where both operations use 
the current snapshot ID. If we want to be explicit about the snapshot that is 
read, we could alternatively add a method to set it.
   
   > we will need some sort of overwriteFiles(deletedFiles, addedFiles) or 
deleteFile(file)
   
   I think it makes sense to add `deleteFile`. This is already supported in the 
implementation, `MergingSnapshotProducer` that is used for almost all 
operations.
   
   > in certain cases, we cannot convert all filters in query engines into 
equivalent Iceberg filters
   
   In these cases, the validation must be that the current snapshot ID has not 
changed, right? In that case, we just need to specify that's the requirement 
with something like `validateNoConflictingAppends()` -- if there is no filter 
then it means any write in the entire table would conflict.
   
   An alternative is to use a more generic filter. We can safely drop `and` 
predicates if they can't be converted. For example, if the filter is `date(ts) 
= '2019-06-01' AND hour(ts) = 10`, the second predicate can't be converted 
(there is no hour-of-day transform). But, we can make sure that none of the 
data for the entire day has changed instead of just hour 10.
   
   > we cannot use the expression passed to overwriteByRowFilter for conflict 
resolution . . . validateNoConflictingAppends will probably need to accept a 
row filter as well.
   
   Agreed. Passing a filter here would work well.
   
   > validateAddedFiles that ensures that all files added by OverwriteFiles 
match the predicate passed in overwriteByRowFilter
   
   We can update docs and the name as well. Maybe it should be 
`validateAddedFilesMatchRowFilter`. In addition, I think we might want to add 
preconditions that check to make sure this method isn't used with other 
configuration methods that may conflict.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to