RussellSpitzer opened a new issue, #5140:
URL: https://github.com/apache/iceberg/issues/5140
Currently we can only control the concurrency of a RewriteDatafiles Action
by setting a static option,
> By default the actions are executed serially, but can be run concurrently
by increasing the value of **max-concurrent-file-group-rewrites**. This
parameter controls the number of actions which will be run simultaneously.
This makes it difficult to properly set the parameter when some partitions
require multiple tasks to complete and others require only a single task. For
example a FileGroup which ends up writing 10 files will require 10 tasks and 10
Spark Cores. Another File-group may only have a single output file requiring
only a single task.
Instead of using a static option, I think we should provide the option of
allowing the Action to attempt to determine the number of open cores and
schedule new jobs accordingly. This "auto" option would basically implement the
following logic:
```
while (fileGroupsLeft is not empty) {
for (group in group) {
if (group.tasks > coresFree) {
schedule(group)
}
}
If unable to schedule any groups since all (group.tasks > totalCores) {
schedule(largestGroup)
}
}
```
Basically we just always schedule jobs when we have open cores, if we have
jobs that are all too large , then just schedule the largest.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]