aokolnychyi edited a comment on issue #351: Provide an API to modify records 
within files
URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-519054168
 
 
   I see the benefit of reusing `OverwriteFiles`, so let's think it through.
   
   We have the following methods in `OverwriteFiles`:
   ```
   OverwriteFiles overwriteByRowFilter(Expression expr);
   OverwriteFiles addFile(DataFile file);
   OverwriteFiles validateAddedFiles();
   ```
   
   First of all, we need to track the base snapshot id when the data was read 
and it should not change during retries. That's the reason why we accept the 
base snapshot id in the constructor. We collect all added files since that 
snapshot. So, we will probably need to do the same for `OverwriteFiles`.
   
   Second, we cannot mark deleted files using `overwriteByRowFilter` as that 
would require all rows to match the predicate and it would also require proper 
column stats when we have predicates on non-partition columns. So, it seems we 
will need some sort of `overwriteFiles(deletedFiles, addedFiles)` or 
`deleteFile`.
   
   Third, we cannot use the expression passed to `overwriteByRowFilter` for 
conflict resolution because of the aforementioned differences. Also, in certain 
cases, we cannot convert all filters in query engines into equivalent Iceberg 
filters. Internally, we prohibit concurrent appends in such cases and pass a 
set of files to delete. It seems we need a separate expression for validating 
concurrent appends. So, `allowConflictingAppends` or 
`validateNoConflictingAppends` will probably need to accept a row filter as 
well. By default, we can set that expression to either true or false depending 
on what behaviour we want.
   
   Fourth, we already have `validateAddedFiles` that ensures that all files 
added by `OverwriteFiles` match the predicate passed in `overwriteByRowFilter`. 
Calling `validateAddedFiles` will only make sense if we use 
`overwriteByRowFilter`, so we need to document that properly to avoid misuse of 
the API. Then `validateAddedFiles` will operate on files passed to 
`OverwriteFiles` and `allowConflictingAppends` or 
`validateNoConflictingAppends` will operate on all files added since the base 
snapshot.
   
   @rdblue, how do you see the merged API?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to