aokolnychyi commented on issue #351: Provide an API to modify records within 
files
URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-519237696
 
 
   > Do we need to add this to the constructor? I think it should be possible 
to start an overwrite, then scan the same table state, where both operations 
use the current snapshot ID. 
   
   Could you elaborate a bit? I am not sure I got the use case. In general, 
knowing the snapshot that was read is essential to guarantee serializable 
isolation in case of updates/deletes. It also has to stay the same during 
retries as we have to analyze every operation that happened after we read the 
data. The main motivation to accept it in the constructor is to avoid asking 
the user to set one more parameter right. Without the correct base snapshot, 
`validateNoConflictingAppends` is useless. I have a test for such a use case 
[here](https://github.com/apache/incubator-iceberg/pull/351/files#diff-96ff93512d7be21e69f35e2bc96f03e9R525).
 For existing use cases with `overwriteByRowFilter`, it should not have any 
impact and is also not visible to the user.
   
   > In these cases, the validation must be that the current snapshot ID has 
not changed, right?
   
   Not necessarily, I guess. If there is a delete that removed other files 
(i.e. not the ones that we are rewriting), it should be OK to commit the 
update. So, I would probably rely on a more generic row filter being set (we do 
this internally now).
   
   Based on the discussion, the API can look like:
   
   ```
   OverwriteFiles overwriteByRowFilter(Expression expr);
   OverwriteFiles addFile(DataFile file);
   OverwriteFiles deleteFile(DataFile file);
   OverwriteFiles validateAddedFilesMatchRowFilter();
   OverwriteFiles validateNoConflictingAppends(Expression expr);
   ```
   
   By default, the expr for `validateNoConflictingAppends` can be set to 
`false` to match the current behaviour. As of now, if we are rewriting via 
`overwriteByRowFilter`, `OverwriteFiles` will successfully retry and delete 
matching files added in a concurrent transaction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to