tanmayrauth commented on issue #16752:
URL: https://github.com/apache/iceberg/issues/16752#issuecomment-4667698726

   Thanks for the detailed write-up, I deep dived into the issue, the core gap 
is real: executeWith(ExecutorService) is unreachable from SQL/PySpark, so 
manifest-list and version-file rewriting is forced sequential.
     
    Before implementing, I'd like to float an alternative mechanism that may 
fit existing conventions better, and get maintainer input on which direction is 
preferred.
   
     The proposed options => map('thread-pool-size', ...) approach would add a 
new string-keyed options() surface (plus two public constants in 
api/RewriteTablePath) to an action that's currently configured purely via typed 
builder setters (stagingLocation, startVersion, executeWith, …). The 
options-map style is really RewriteDataFiles' pattern, which it has because of 
pluggable strategies with many knobs —  whereas this is a single parallelism 
knob.
     
     There's already an established convention for exactly this: 
ExpireSnapshotsProcedure and RemoveOrphanFilesProcedure expose a scalar 
max_concurrent_deletes (IntegerType) and build the pool through  
BaseProcedure.executorService(int, String), which also handles shutdown 
(getExitingExecutorService + closeService()). Following that, the change could 
be procedure-only:
   
   ```
     optionalInParameter("max_concurrent_metadata_rewrites", 
DataTypes.IntegerType);
     ...
     if (maxConcurrent != null) {
       Preconditions.checkArgument(maxConcurrent > 0, 
"max_concurrent_metadata_rewrites should be > 0");
       action.executeWith(executorService(maxConcurrent, "rewrite-table-path"));
     }
   ```
   
   This needs no change to api/, no change to the action, and no manual 
executor shutdown — it just routes a scalar param into the existing executeWith.
   
     Two things I'd want a maintainer call on:
     1. Param name / shape — scalar max_concurrent_metadata_rewrites vs. the 
proposed options map.
     2. Default — the existing procedures default to sequential unless the 
param is set, rather than a hardcoded default like 4. I'd lean toward not 
silently enabling parallelism, but happy to match whatever's preferred. 
     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to