Hi everyone,

I've been looking at the Iceberg Actions [1] and noticed many of them don't
fundamentally require a distributed engine.

Apart from RewriteDataFiles, most of the maintenance tasks are rather
lightweight in the processing department. Some of them could probably run
faster and with fewer resources locally, backed by a thread pool.

I wonder whether Iceberg could benefit from a local implementation for
ActionsProvider [2]. We have a lot of the building blocks for these already
available in the core.

Granted, there are scalability limitations for large tables. Also, it's
often more convenient to use existing (distributed) compute infrastructure.
Yet, there are use cases where distributed computing isn't strictly
required. For example:

  - CLI tooling
  - CI/CD pipelines and automation scripts
  - REST catalog backends which want to run maintenance internally
  - Small tables in general
  - Environments where Flink/Spark are not available

I'm curious to hear your thoughts.

Cheers,
Max

[1]
https://github.com/apache/iceberg/tree/501824f0c0032b3225b0fe52b904756f0fe5c589/api/src/main/java/org/apache/iceberg/actions
[2]
https://github.com/apache/iceberg/blob/501824f0c0032b3225b0fe52b904756f0fe5c589/api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java#L24

Reply via email to