Re: [DISCUSS] Publishing more Docker images from the Iceberg GitHub repo

Kevin Liu Sat, 21 Feb 2026 15:08:12 -0800

> My question is: can't we use a Spark (
https://hub.docker.com/r/apache/spark) container and pass in Iceberg using
`--packages`?


Great question! I think it's worth taking a look at what a prebuilt image
provides over the vanilla Spark image.

Take the spark-iceberg image that pyiceberg builds locally [1] as an
example. It uses the base Spark image, downloads the relevant JARs
(iceberg-spark-runtime & iceberg-aws-bundle), and copies over
spark-defaults.conf to configure the catalog and object storage. It expects
the REST catalog (IRC) to already be running on port 8181 and object
storage on port 9000. Once running, the image provides an environment
capable of executing Spark SQL commands.

Note that the image alone does not constitute a "quickstart" environment. A
quickstart requires a Spark-Iceberg environment (this image), an IRC
(iceberg-rest-fixture), and an object store; all working together. This is
what the Spark quickstart spins up today [2].


Given all of the above, I propose we publish an image that comes
pre-installed with the runtime JARs and is configurable via environment
variables to accept both IRC and object store settings. This would allow
users to set up a Spark-Iceberg runtime and connect it to their own catalog
and object storage.

For subprojects, we can use this image in a Docker Compose definition
alongside iceberg-rest-fixture and MinIO (it directly replaces the locally
built image [3]). This would allow us to stop maintaining separate
Dockerfiles in both pyiceberg and iceberg-rust [4][5], and centralize
Docker image management and improvements. (And I can finally stop copying
over Dockerfile improvements from one repo to another).

We can do this for both Spark and Flink; the two engines whose integrations
live in the Iceberg repo. For other engines like Trino and Hive, we can
guide their respective communities to create similar images if they choose
to.

Best,
Kevin Liu


[1] https://github.com/apache/iceberg-python/blob/main/dev/spark/Dockerfile
[2] https://iceberg.apache.org/spark-quickstart/#docker-compose
[3]
https://github.com/apache/iceberg-python/blob/95f6273b23524c6238aafb57fa06e693ef83d6ef/dev/docker-compose-integration.yml#L19-L35
[4] https://github.com/apache/iceberg-python/blob/main/dev/spark/Dockerfile
[5] https://github.com/apache/iceberg-rust/blob/main/dev/spark/Dockerfile

On Fri, Feb 20, 2026 at 5:17 PM Matt Topol <[email protected]> wrote:

> I just wanted to chime in that I would love published images to use for
> integration testing instead of all the individual configs and scripts that
> exist in Iceberg-go. Currently the configuration we have was attempting to
> replicate the config in iceberg-python for integration tests with spark,
> but keeping it in sync is difficult.
>
> Having shared docker images for integration testing would be a great first
> step towards a more formal integration testing suite.
>
> --Matt
>
> On Fri, Feb 20, 2026, 2:28 PM Fokko Driesprong <[email protected]> wrote:
>
>> Thanks for bringing this up Kevin, and I share the perspective by Sung.
>>
>> I've been maintaining the tabulario/spark-iceberg for quite a while, and
>> it is very time consuming. I'm not saying that we should not do it, but
>> when we commit to it, it should work well since these quickstarts are the
>> gateway for new users.
>>
>> For example, we had Papermill (
>> https://papermill.readthedocs.io/en/latest/usage-execute.html) running
>> in the CI to ensure that the notebooks are working.
>>
>> > Since we already publish a Spark docker image for PyIceberg project, it
>> makes more sense to publish it from the Iceberg repository instead of the
>> PyIceberg repository.
>>
>> We actually push the Docker image; we rely on a locally built image
>> that's part of the repository. We only have the REST fixtures out there:
>> https://hub.docker.com/u/apache?page=1&search=iceberg. Which I really
>> like because of the nightly build. This way, we can start implementing
>> remote scan planning on the PyIceberg side.
>>
>> > (1) giving users a quick, out-of-the-box way to get started with
>> Iceberg on a given engine
>>
>> My question is: can't we use a Spark (
>> https://hub.docker.com/r/apache/spark) container and pass in Iceberg
>> using `--packages`?
>>
>> > (2) providing subprojects (like iceberg-python and iceberg-rust) a
>> shared, canonical image to depend on for integration testing; eliminating
>> the duplicated maintenance we have today.
>>
>> I fully agree here! There is a LOT of duplication across the project, and
>> it would be nice to consolidate it into one place. Having an image with all
>> the common examples: table with positional deletes, table with multiple
>> positional deletes, table with a DV, etc.
>>
>> Kind regards,
>> Fokko
>>
>> On 2026/02/20 17:13:13 Steven Wu wrote:
>> > I had the same thought as Peter. For the connectors living in Iceberg
>> repo
>> > (Flink, Kafka Connect, Spark), Iceberg should publish the Docker images
>> if
>> > we agree that the benefit outweighs the overhead.
>> >
>> > Since we already publish a Spark docker image for PyIceberg project, it
>> > makes more sense to publish it from the Iceberg repository instead of
>> the
>> > PyIceberg repository.
>> >
>> > On Fri, Feb 20, 2026 at 7:19 AM Kevin Liu <[email protected]>
>> wrote:
>> >
>> > > > Given this, my suggestion is that Iceberg should publish the
>> quickstart
>> > > Docker images for integrations we own, like Spark and Flink. For
>> > > integrations where we don’t own the code, such as Trino and Hive, the
>> > > respective projects should continue to publish their own images.
>> > >
>> > > +1, this pretty much summarizes my thoughts. And I think it also
>> aligns
>> > > with what Sung mentioned above.
>> > > Publishing Iceberg Spark and Flink docker images is a great outcome
>> IMO.
>> > > And of course, we have to ensure compliance with ASF policies. :)
>> > >
>> > > The two main use cases I see for expanding our published images are:
>> > > (1) giving users a quick, out-of-the-box way to get started with
>> Iceberg
>> > > on a given engine, and
>> > > (2) providing subprojects (like iceberg-python and iceberg-rust) a
>> shared,
>> > > canonical image to depend on for integration testing; eliminating the
>> > > duplicated maintenance we have today.
>> > >
>> > > Would love to hear about what others think.
>> > >
>> > > Best,
>> > > Kevin Liu
>> > >
>> > >
>> > > On Fri, Feb 20, 2026 at 1:04 AM Péter Váry <
>> [email protected]>
>> > > wrote:
>> > >
>> > >> One important aspect to consider is where the integration code
>> actually
>> > >> lives. Both the Spark and Flink integrations are maintained directly
>> in the
>> > >> Iceberg repository, which means the Iceberg community is responsible
>> for
>> > >> keeping these connectors working. If we moved the Docker image
>> creation
>> > >> into the Spark or Flink projects, we would introduce a circular
>> dependency
>> > >> that would make release coordination much more complicated.
>> > >>
>> > >> For example, imagine Spark releases version 4.2. At that point, no
>> > >> Iceberg integration exists yet. Once we update Iceberg, the support
>> for
>> > >> Spark 4.2 would land in an Iceberg release. Let's say, Iceberg
>> 1.12.0. At
>> > >> that point, we can publish the iceberg-1.12.0-spark-4.2-quickstart
>> image,
>> > >> aligned with our release cycle. But if the Spark project were
>> responsible
>> > >> for publishing the image, they would need a separate, additional
>> release
>> > >> cycle just for the Docker image, which doesn't fit naturally into
>> their
>> > >> workflow.
>> > >>
>> > >> Given this, my suggestion is that Iceberg should publish the
>> quickstart
>> > >> Docker images for integrations we own, like Spark and Flink. For
>> > >> integrations where we don’t own the code, such as Trino and Hive, the
>> > >> respective projects should continue to publish their own images.
>> > >>
>> > >> Sung Yun <[email protected]> ezt írta (időpont: 2026. febr. 20., P,
>> > >> 3:29):
>> > >>
>> > >>> Hi Kevin, thanks for raising this.
>> > >>>
>> > >>> I agree this discussion is warranted. In the previous thread [1] we
>> > >>> largely deferred making a decision on whether the Iceberg community
>> should
>> > >>> publish Docker images beyond the REST TCK image, so I think it
>> makes sense
>> > >>> to revisit it now.
>> > >>>
>> > >>> While it's tempting to help out the community in every possible
>> way, I
>> > >>> think it's important to stay focused on what the project
>> /subprojects are
>> > >>> best positioned to own. In a way, I'm concerned that publishing
>> engine
>> > >>> specific Iceberg images as supported artifacts could create a long
>> term
>> > >>> maintenance burden, since we don't maintain those engines ourselves.
>> > >>>
>> > >>> From my perspective, the key question is on what criteria we should
>> use
>> > >>> when deciding whether to publish a Docker image, and I think the
>> clearest
>> > >>> line is whether it supports Iceberg subprojects (or other OSS
>> projects) in
>> > >>> testing their integration with Iceberg, where we can reasonably
>> expect it
>> > >>> to support it to a high standard.
>> > >>>
>> > >>> I'm curious to hear others' thoughts on this topic.
>> > >>>
>> > >>> Cheers,
>> > >>> Sung
>> > >>>
>> > >>> [1]
>> https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq
>> > >>>
>> > >>> On 2026/02/19 21:06:56 Kevin Liu wrote:
>> > >>> > Hi everyone,
>> > >>> >
>> > >>> > I want to continue the discussion on which Docker images the
>> Iceberg
>> > >>> > project should publish. This has come up several times
>> [1][2][3][4]
>> > >>> and I'd
>> > >>> > like to continue the discussion here.
>> > >>> >
>> > >>> > So far, the main outcome has been the publication of
>> > >>> > apache/iceberg-rest-fixture [5] (100K+ downloads), following a
>> > >>> consensus
>> > >>> > [2] to limit community-maintained images to the REST fixture and
>> rely
>> > >>> on
>> > >>> > upstream engine projects for quickstarts. A separate thread and
>> issue
>> > >>> > [3][6] proposed replacing the tabulario/spark-iceberg quickstart
>> image
>> > >>> with
>> > >>> > the official apache/spark image. Most recently, a proposal to add
>> a
>> > >>> Flink
>> > >>> > quickstart image [4] has reopened the broader question.
>> > >>> >
>> > >>> > One concrete case for expanding scope: both iceberg-python and
>> > >>> iceberg-rust
>> > >>> > currently maintain their own Spark+Iceberg Docker images for
>> > >>> integration
>> > >>> > testing, and we already try to keep them in sync manually [7][8].
>> This
>> > >>> is
>> > >>> > exactly the kind of duplication that centralizing under the main
>> > >>> iceberg
>> > >>> > repo would solve; just as we did with apache/iceberg-rest-fixture.
>> > >>> > Publishing a shared apache/iceberg-spark image would give all
>> > >>> subprojects a
>> > >>> > single, well-maintained image to depend on, and reduce the
>> maintenance
>> > >>> > burden across the ecosystem. We can do the same for the
>> Flink+Iceberg
>> > >>> setup.
>> > >>> >
>> > >>> > Given the traction the REST fixture image has seen, I think it's
>> worth
>> > >>> > revisiting the scope of what we publish. I'd love to hear updated
>> views
>> > >>> > from the community.
>> > >>> >
>> > >>> > Thanks,
>> > >>> > Kevin Liu
>> > >>> >
>> > >>> > [1]
>> https://lists.apache.org/thread/dr6nsvd8jm2gr2nn5vf7nkpr0pc5d03h
>> > >>> > [2]
>> https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq
>> > >>> > [3]
>> https://lists.apache.org/thread/4kknk8mvnffbmhdt63z8t4ps0mt1jbf4
>> > >>> > [4]
>> https://lists.apache.org/thread/grlgvl9fslcxrsnxyb7qqh7vjd4kkqo3
>> > >>> > [5] https://hub.docker.com/r/apache/iceberg-rest-fixture
>> > >>> > [6] https://github.com/apache/iceberg/issues/13519
>> > >>> > [7] https://github.com/apache/iceberg-python/tree/main/dev/spark
>> > >>> > [8] https://github.com/apache/iceberg-rust/tree/main/dev/spark
>> > >>> >
>> > >>>
>> > >>
>> >
>>
>

Re: [DISCUSS] Publishing more Docker images from the Iceberg GitHub repo

Reply via email to