tanmayrauth commented on issue #16752:
URL: https://github.com/apache/iceberg/issues/16752#issuecomment-4667698726
Thanks for the detailed write-up, I deep dived into the issue, the core gap
is real: executeWith(ExecutorService) is unreachable from SQL/PySpark, so
manifest-list and version-file rewriting is forced sequential.
Before implementing, I'd like to float an alternative mechanism that may
fit existing conventions better, and get maintainer input on which direction is
preferred.
The proposed options => map('thread-pool-size', ...) approach would add a
new string-keyed options() surface (plus two public constants in
api/RewriteTablePath) to an action that's currently configured purely via typed
builder setters (stagingLocation, startVersion, executeWith, …). The
options-map style is really RewriteDataFiles' pattern, which it has because of
pluggable strategies with many knobs — whereas this is a single parallelism
knob.
There's already an established convention for exactly this:
ExpireSnapshotsProcedure and RemoveOrphanFilesProcedure expose a scalar
max_concurrent_deletes (IntegerType) and build the pool through
BaseProcedure.executorService(int, String), which also handles shutdown
(getExitingExecutorService + closeService()). Following that, the change could
be procedure-only:
```
optionalInParameter("max_concurrent_metadata_rewrites",
DataTypes.IntegerType);
...
if (maxConcurrent != null) {
Preconditions.checkArgument(maxConcurrent > 0,
"max_concurrent_metadata_rewrites should be > 0");
action.executeWith(executorService(maxConcurrent, "rewrite-table-path"));
}
```
This needs no change to api/, no change to the action, and no manual
executor shutdown — it just routes a scalar param into the existing executeWith.
Two things I'd want a maintainer call on:
1. Param name / shape — scalar max_concurrent_metadata_rewrites vs. the
proposed options map.
2. Default — the existing procedures default to sequential unless the
param is set, rather than a hardcoded default like 4. I'd lean toward not
silently enabling parallelism, but happy to match whatever's preferred.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]