I just wanted to chime in that I would love published images to use for integration testing instead of all the individual configs and scripts that exist in Iceberg-go. Currently the configuration we have was attempting to replicate the config in iceberg-python for integration tests with spark, but keeping it in sync is difficult.
Having shared docker images for integration testing would be a great first step towards a more formal integration testing suite. --Matt On Fri, Feb 20, 2026, 2:28 PM Fokko Driesprong <[email protected]> wrote: > Thanks for bringing this up Kevin, and I share the perspective by Sung. > > I've been maintaining the tabulario/spark-iceberg for quite a while, and > it is very time consuming. I'm not saying that we should not do it, but > when we commit to it, it should work well since these quickstarts are the > gateway for new users. > > For example, we had Papermill ( > https://papermill.readthedocs.io/en/latest/usage-execute.html) running in > the CI to ensure that the notebooks are working. > > > Since we already publish a Spark docker image for PyIceberg project, it > makes more sense to publish it from the Iceberg repository instead of the > PyIceberg repository. > > We actually push the Docker image; we rely on a locally built image that's > part of the repository. We only have the REST fixtures out there: > https://hub.docker.com/u/apache?page=1&search=iceberg. Which I really > like because of the nightly build. This way, we can start implementing > remote scan planning on the PyIceberg side. > > > (1) giving users a quick, out-of-the-box way to get started with Iceberg > on a given engine > > My question is: can't we use a Spark ( > https://hub.docker.com/r/apache/spark) container and pass in Iceberg > using `--packages`? > > > (2) providing subprojects (like iceberg-python and iceberg-rust) a > shared, canonical image to depend on for integration testing; eliminating > the duplicated maintenance we have today. > > I fully agree here! There is a LOT of duplication across the project, and > it would be nice to consolidate it into one place. Having an image with all > the common examples: table with positional deletes, table with multiple > positional deletes, table with a DV, etc. > > Kind regards, > Fokko > > On 2026/02/20 17:13:13 Steven Wu wrote: > > I had the same thought as Peter. For the connectors living in Iceberg > repo > > (Flink, Kafka Connect, Spark), Iceberg should publish the Docker images > if > > we agree that the benefit outweighs the overhead. > > > > Since we already publish a Spark docker image for PyIceberg project, it > > makes more sense to publish it from the Iceberg repository instead of the > > PyIceberg repository. > > > > On Fri, Feb 20, 2026 at 7:19 AM Kevin Liu <[email protected]> wrote: > > > > > > Given this, my suggestion is that Iceberg should publish the > quickstart > > > Docker images for integrations we own, like Spark and Flink. For > > > integrations where we don’t own the code, such as Trino and Hive, the > > > respective projects should continue to publish their own images. > > > > > > +1, this pretty much summarizes my thoughts. And I think it also aligns > > > with what Sung mentioned above. > > > Publishing Iceberg Spark and Flink docker images is a great outcome > IMO. > > > And of course, we have to ensure compliance with ASF policies. :) > > > > > > The two main use cases I see for expanding our published images are: > > > (1) giving users a quick, out-of-the-box way to get started with > Iceberg > > > on a given engine, and > > > (2) providing subprojects (like iceberg-python and iceberg-rust) a > shared, > > > canonical image to depend on for integration testing; eliminating the > > > duplicated maintenance we have today. > > > > > > Would love to hear about what others think. > > > > > > Best, > > > Kevin Liu > > > > > > > > > On Fri, Feb 20, 2026 at 1:04 AM Péter Váry < > [email protected]> > > > wrote: > > > > > >> One important aspect to consider is where the integration code > actually > > >> lives. Both the Spark and Flink integrations are maintained directly > in the > > >> Iceberg repository, which means the Iceberg community is responsible > for > > >> keeping these connectors working. If we moved the Docker image > creation > > >> into the Spark or Flink projects, we would introduce a circular > dependency > > >> that would make release coordination much more complicated. > > >> > > >> For example, imagine Spark releases version 4.2. At that point, no > > >> Iceberg integration exists yet. Once we update Iceberg, the support > for > > >> Spark 4.2 would land in an Iceberg release. Let's say, Iceberg > 1.12.0. At > > >> that point, we can publish the iceberg-1.12.0-spark-4.2-quickstart > image, > > >> aligned with our release cycle. But if the Spark project were > responsible > > >> for publishing the image, they would need a separate, additional > release > > >> cycle just for the Docker image, which doesn't fit naturally into > their > > >> workflow. > > >> > > >> Given this, my suggestion is that Iceberg should publish the > quickstart > > >> Docker images for integrations we own, like Spark and Flink. For > > >> integrations where we don’t own the code, such as Trino and Hive, the > > >> respective projects should continue to publish their own images. > > >> > > >> Sung Yun <[email protected]> ezt írta (időpont: 2026. febr. 20., P, > > >> 3:29): > > >> > > >>> Hi Kevin, thanks for raising this. > > >>> > > >>> I agree this discussion is warranted. In the previous thread [1] we > > >>> largely deferred making a decision on whether the Iceberg community > should > > >>> publish Docker images beyond the REST TCK image, so I think it makes > sense > > >>> to revisit it now. > > >>> > > >>> While it's tempting to help out the community in every possible way, > I > > >>> think it's important to stay focused on what the project > /subprojects are > > >>> best positioned to own. In a way, I'm concerned that publishing > engine > > >>> specific Iceberg images as supported artifacts could create a long > term > > >>> maintenance burden, since we don't maintain those engines ourselves. > > >>> > > >>> From my perspective, the key question is on what criteria we should > use > > >>> when deciding whether to publish a Docker image, and I think the > clearest > > >>> line is whether it supports Iceberg subprojects (or other OSS > projects) in > > >>> testing their integration with Iceberg, where we can reasonably > expect it > > >>> to support it to a high standard. > > >>> > > >>> I'm curious to hear others' thoughts on this topic. > > >>> > > >>> Cheers, > > >>> Sung > > >>> > > >>> [1] https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq > > >>> > > >>> On 2026/02/19 21:06:56 Kevin Liu wrote: > > >>> > Hi everyone, > > >>> > > > >>> > I want to continue the discussion on which Docker images the > Iceberg > > >>> > project should publish. This has come up several times [1][2][3][4] > > >>> and I'd > > >>> > like to continue the discussion here. > > >>> > > > >>> > So far, the main outcome has been the publication of > > >>> > apache/iceberg-rest-fixture [5] (100K+ downloads), following a > > >>> consensus > > >>> > [2] to limit community-maintained images to the REST fixture and > rely > > >>> on > > >>> > upstream engine projects for quickstarts. A separate thread and > issue > > >>> > [3][6] proposed replacing the tabulario/spark-iceberg quickstart > image > > >>> with > > >>> > the official apache/spark image. Most recently, a proposal to add a > > >>> Flink > > >>> > quickstart image [4] has reopened the broader question. > > >>> > > > >>> > One concrete case for expanding scope: both iceberg-python and > > >>> iceberg-rust > > >>> > currently maintain their own Spark+Iceberg Docker images for > > >>> integration > > >>> > testing, and we already try to keep them in sync manually [7][8]. > This > > >>> is > > >>> > exactly the kind of duplication that centralizing under the main > > >>> iceberg > > >>> > repo would solve; just as we did with apache/iceberg-rest-fixture. > > >>> > Publishing a shared apache/iceberg-spark image would give all > > >>> subprojects a > > >>> > single, well-maintained image to depend on, and reduce the > maintenance > > >>> > burden across the ecosystem. We can do the same for the > Flink+Iceberg > > >>> setup. > > >>> > > > >>> > Given the traction the REST fixture image has seen, I think it's > worth > > >>> > revisiting the scope of what we publish. I'd love to hear updated > views > > >>> > from the community. > > >>> > > > >>> > Thanks, > > >>> > Kevin Liu > > >>> > > > >>> > [1] > https://lists.apache.org/thread/dr6nsvd8jm2gr2nn5vf7nkpr0pc5d03h > > >>> > [2] > https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq > > >>> > [3] > https://lists.apache.org/thread/4kknk8mvnffbmhdt63z8t4ps0mt1jbf4 > > >>> > [4] > https://lists.apache.org/thread/grlgvl9fslcxrsnxyb7qqh7vjd4kkqo3 > > >>> > [5] https://hub.docker.com/r/apache/iceberg-rest-fixture > > >>> > [6] https://github.com/apache/iceberg/issues/13519 > > >>> > [7] https://github.com/apache/iceberg-python/tree/main/dev/spark > > >>> > [8] https://github.com/apache/iceberg-rust/tree/main/dev/spark > > >>> > > > >>> > > >> > > >
