dramaticlly commented on PR #5373: URL: https://github.com/apache/iceberg/pull/5373#issuecomment-1201876458
> In the updated logic for the procedure, we will always use the delete func if it's specified. > > If a delete function is not specified and the file io does not support bulk operations we would use the default delete function. Otherwise the bulk delete is used by default. > > What this means is bulk operations will always be used (if the file io supports it) without considering user input. I was discussing with @dramaticlly perhaps we want more control over this and it should be specified in the action (something like useBulkDelete())? > Thanks @amogh-jahagirdar , I was mostly thinking about this along the same line for #5412. In ideal case, we shall try to use Bulk operation if available to us (by inspecting the fileIO of given table, S3FileIO is the only supported one as of now ). On the other side, I think we need to maintain the backward compatibility, if some iceberg user already provide custom `deleteFunc` to the SparkAction in the past, everything shall work as expected. In my change, I added a new public method called `bulkDeleteFunc` to allow overrides if needed, this provides a way for customization even for tables on S3fileIO, or it can be used to disable the Bulk deletion for some troubleshooting and debugging needs. > Also now we are delegating task management to the file IO, which I think makes sense but there's another argument that each procedure should control this since failure handling or retries would depend on the desired behavior for the procedure. What are peoples thoughts here? Personally I think delegate to fileIO for batching make sense to me, add additional parameter like batchSize in procedure call might cause confusion and hard to make it right. The downside I saw is about test hardness, wish for more inputs around it for test between classess. Current test class are all using Hadooptables and pick HiveTable with fileIO support SupportsBulkOperations seem to be overkill for unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
