RussellSpitzer opened a new issue, #5140:
URL: https://github.com/apache/iceberg/issues/5140

   Currently we can only control the concurrency of a RewriteDatafiles Action 
by setting a static option, 
   
   > By default the actions are executed serially, but can be run concurrently 
by increasing the value of **max-concurrent-file-group-rewrites**. This 
parameter controls the number of actions which will be run simultaneously.
   
   This makes it difficult to properly set the parameter when some partitions 
require multiple tasks to complete and others require only a single task. For 
example a FileGroup which ends up writing 10 files will require 10 tasks and 10 
Spark Cores. Another File-group may only have a single output file requiring 
only a single task.
   
   Instead of using a static option, I think we should provide the option of 
allowing the Action to attempt to determine the number of open cores and 
schedule new jobs accordingly. This "auto" option would basically implement the 
following logic:
   
   ```
   while (fileGroupsLeft is not empty) {
     for (group in group) {
       if (group.tasks > coresFree) {
         schedule(group)
       }
     }
     If unable to schedule any groups since all (group.tasks > totalCores) {
       schedule(largestGroup)
     }
   }
   ```
   Basically we just always schedule jobs when we have open cores, if we have 
jobs that are all too large , then just schedule the largest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to