Re: [DISCUSS] Publishing more Docker images from the Iceberg GitHub repo

Matt Topol Fri, 20 Feb 2026 17:17:41 -0800

I just wanted to chime in that I would love published images to use for
integration testing instead of all the individual configs and scripts that
exist in Iceberg-go. Currently the configuration we have was attempting to
replicate the config in iceberg-python for integration tests with spark,
but keeping it in sync is difficult.


Having shared docker images for integration testing would be a great first
step towards a more formal integration testing suite.

--Matt

On Fri, Feb 20, 2026, 2:28 PM Fokko Driesprong <[email protected]> wrote:

> Thanks for bringing this up Kevin, and I share the perspective by Sung.
>
> I've been maintaining the tabulario/spark-iceberg for quite a while, and
> it is very time consuming. I'm not saying that we should not do it, but
> when we commit to it, it should work well since these quickstarts are the
> gateway for new users.
>
> For example, we had Papermill (
> https://papermill.readthedocs.io/en/latest/usage-execute.html) running in
> the CI to ensure that the notebooks are working.
>
> > Since we already publish a Spark docker image for PyIceberg project, it
> makes more sense to publish it from the Iceberg repository instead of the
> PyIceberg repository.
>
> We actually push the Docker image; we rely on a locally built image that's
> part of the repository. We only have the REST fixtures out there:
> https://hub.docker.com/u/apache?page=1&search=iceberg. Which I really
> like because of the nightly build. This way, we can start implementing
> remote scan planning on the PyIceberg side.
>
> > (1) giving users a quick, out-of-the-box way to get started with Iceberg
> on a given engine
>
> My question is: can't we use a Spark (
> https://hub.docker.com/r/apache/spark) container and pass in Iceberg
> using `--packages`?
>
> > (2) providing subprojects (like iceberg-python and iceberg-rust) a
> shared, canonical image to depend on for integration testing; eliminating
> the duplicated maintenance we have today.
>
> I fully agree here! There is a LOT of duplication across the project, and
> it would be nice to consolidate it into one place. Having an image with all
> the common examples: table with positional deletes, table with multiple
> positional deletes, table with a DV, etc.
>
> Kind regards,
> Fokko
>
> On 2026/02/20 17:13:13 Steven Wu wrote:
> > I had the same thought as Peter. For the connectors living in Iceberg
> repo
> > (Flink, Kafka Connect, Spark), Iceberg should publish the Docker images
> if
> > we agree that the benefit outweighs the overhead.
> >
> > Since we already publish a Spark docker image for PyIceberg project, it
> > makes more sense to publish it from the Iceberg repository instead of the
> > PyIceberg repository.
> >
> > On Fri, Feb 20, 2026 at 7:19 AM Kevin Liu <[email protected]> wrote:
> >
> > > > Given this, my suggestion is that Iceberg should publish the
> quickstart
> > > Docker images for integrations we own, like Spark and Flink. For
> > > integrations where we don’t own the code, such as Trino and Hive, the
> > > respective projects should continue to publish their own images.
> > >
> > > +1, this pretty much summarizes my thoughts. And I think it also aligns
> > > with what Sung mentioned above.
> > > Publishing Iceberg Spark and Flink docker images is a great outcome
> IMO.
> > > And of course, we have to ensure compliance with ASF policies. :)
> > >
> > > The two main use cases I see for expanding our published images are:
> > > (1) giving users a quick, out-of-the-box way to get started with
> Iceberg
> > > on a given engine, and
> > > (2) providing subprojects (like iceberg-python and iceberg-rust) a
> shared,
> > > canonical image to depend on for integration testing; eliminating the
> > > duplicated maintenance we have today.
> > >
> > > Would love to hear about what others think.
> > >
> > > Best,
> > > Kevin Liu
> > >
> > >
> > > On Fri, Feb 20, 2026 at 1:04 AM Péter Váry <
> [email protected]>
> > > wrote:
> > >
> > >> One important aspect to consider is where the integration code
> actually
> > >> lives. Both the Spark and Flink integrations are maintained directly
> in the
> > >> Iceberg repository, which means the Iceberg community is responsible
> for
> > >> keeping these connectors working. If we moved the Docker image
> creation
> > >> into the Spark or Flink projects, we would introduce a circular
> dependency
> > >> that would make release coordination much more complicated.
> > >>
> > >> For example, imagine Spark releases version 4.2. At that point, no
> > >> Iceberg integration exists yet. Once we update Iceberg, the support
> for
> > >> Spark 4.2 would land in an Iceberg release. Let's say, Iceberg
> 1.12.0. At
> > >> that point, we can publish the iceberg-1.12.0-spark-4.2-quickstart
> image,
> > >> aligned with our release cycle. But if the Spark project were
> responsible
> > >> for publishing the image, they would need a separate, additional
> release
> > >> cycle just for the Docker image, which doesn't fit naturally into
> their
> > >> workflow.
> > >>
> > >> Given this, my suggestion is that Iceberg should publish the
> quickstart
> > >> Docker images for integrations we own, like Spark and Flink. For
> > >> integrations where we don’t own the code, such as Trino and Hive, the
> > >> respective projects should continue to publish their own images.
> > >>
> > >> Sung Yun <[email protected]> ezt írta (időpont: 2026. febr. 20., P,
> > >> 3:29):
> > >>
> > >>> Hi Kevin, thanks for raising this.
> > >>>
> > >>> I agree this discussion is warranted. In the previous thread [1] we
> > >>> largely deferred making a decision on whether the Iceberg community
> should
> > >>> publish Docker images beyond the REST TCK image, so I think it makes
> sense
> > >>> to revisit it now.
> > >>>
> > >>> While it's tempting to help out the community in every possible way,
> I
> > >>> think it's important to stay focused on what the project
> /subprojects are
> > >>> best positioned to own. In a way, I'm concerned that publishing
> engine
> > >>> specific Iceberg images as supported artifacts could create a long
> term
> > >>> maintenance burden, since we don't maintain those engines ourselves.
> > >>>
> > >>> From my perspective, the key question is on what criteria we should
> use
> > >>> when deciding whether to publish a Docker image, and I think the
> clearest
> > >>> line is whether it supports Iceberg subprojects (or other OSS
> projects) in
> > >>> testing their integration with Iceberg, where we can reasonably
> expect it
> > >>> to support it to a high standard.
> > >>>
> > >>> I'm curious to hear others' thoughts on this topic.
> > >>>
> > >>> Cheers,
> > >>> Sung
> > >>>
> > >>> [1] https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq
> > >>>
> > >>> On 2026/02/19 21:06:56 Kevin Liu wrote:
> > >>> > Hi everyone,
> > >>> >
> > >>> > I want to continue the discussion on which Docker images the
> Iceberg
> > >>> > project should publish. This has come up several times [1][2][3][4]
> > >>> and I'd
> > >>> > like to continue the discussion here.
> > >>> >
> > >>> > So far, the main outcome has been the publication of
> > >>> > apache/iceberg-rest-fixture [5] (100K+ downloads), following a
> > >>> consensus
> > >>> > [2] to limit community-maintained images to the REST fixture and
> rely
> > >>> on
> > >>> > upstream engine projects for quickstarts. A separate thread and
> issue
> > >>> > [3][6] proposed replacing the tabulario/spark-iceberg quickstart
> image
> > >>> with
> > >>> > the official apache/spark image. Most recently, a proposal to add a
> > >>> Flink
> > >>> > quickstart image [4] has reopened the broader question.
> > >>> >
> > >>> > One concrete case for expanding scope: both iceberg-python and
> > >>> iceberg-rust
> > >>> > currently maintain their own Spark+Iceberg Docker images for
> > >>> integration
> > >>> > testing, and we already try to keep them in sync manually [7][8].
> This
> > >>> is
> > >>> > exactly the kind of duplication that centralizing under the main
> > >>> iceberg
> > >>> > repo would solve; just as we did with apache/iceberg-rest-fixture.
> > >>> > Publishing a shared apache/iceberg-spark image would give all
> > >>> subprojects a
> > >>> > single, well-maintained image to depend on, and reduce the
> maintenance
> > >>> > burden across the ecosystem. We can do the same for the
> Flink+Iceberg
> > >>> setup.
> > >>> >
> > >>> > Given the traction the REST fixture image has seen, I think it's
> worth
> > >>> > revisiting the scope of what we publish. I'd love to hear updated
> views
> > >>> > from the community.
> > >>> >
> > >>> > Thanks,
> > >>> > Kevin Liu
> > >>> >
> > >>> > [1]
> https://lists.apache.org/thread/dr6nsvd8jm2gr2nn5vf7nkpr0pc5d03h
> > >>> > [2]
> https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq
> > >>> > [3]
> https://lists.apache.org/thread/4kknk8mvnffbmhdt63z8t4ps0mt1jbf4
> > >>> > [4]
> https://lists.apache.org/thread/grlgvl9fslcxrsnxyb7qqh7vjd4kkqo3
> > >>> > [5] https://hub.docker.com/r/apache/iceberg-rest-fixture
> > >>> > [6] https://github.com/apache/iceberg/issues/13519
> > >>> > [7] https://github.com/apache/iceberg-python/tree/main/dev/spark
> > >>> > [8] https://github.com/apache/iceberg-rust/tree/main/dev/spark
> > >>> >
> > >>>
> > >>
> >
>

Re: [DISCUSS] Publishing more Docker images from the Iceberg GitHub repo

Reply via email to