Thanx folks for the pointers on the performance testing. Let me discuss this internally and come back with something more concrete. One idea that comes to mind is that we are currently using LFS in our Docker images; instead, we could potentially use Apache Ozone there. They also publish Docker images, so we might be able to leverage those.
-Ayush On Tue, 10 Mar 2026 at 12:31, László Bodor <[email protected]> wrote: > > Regarding performance benchmarking, we should have a way to test the actual > upstream code. While many - or all - Hive distributors have their own ways of > doing this, we as an open-source community don't. The main limitation is the > testing setup, because our current single-image (HS2) or HS2+HMS Docker setup > is not suitable for this purpose, even though it works wonderfully for quick > local testing. > That's what's currently being addressed in the scope of > https://issues.apache.org/jira/browse/HIVE-29492. > > Regards, > Laszlo Bodor > > > On Tue, 10 Mar 2026 at 07:38, kokila narayanan <[email protected]> > wrote: >> >> >> Regarding the performance tracking initiative and Hive-Iceberg workloads, >> one possible starting point could be leveraging the 1 Trillion Row Challenge >> (1TRC) style benchmarks. >> >> The Impala community has already experimented with something along these >> lines and they have even extended it to work with Iceberg tables as well: >> https://github.com/boroknagyz/impala-1trc >> >> The main query is relatively simple aggregation query: >> >> SELECT station, min(measure), max(measure), avg(measure) >> FROM measurements_1trc >> GROUP BY station >> ORDER BY station; >> >> While this benchmark is quite simple and only tests a single type of query, >> it could still be a good starting point. It does not cover the wider variety >> of queries we usually see in Hive workloads (like joins, filters, or more >> complex aggregations), but it is easy to reproduce and run. >> >> With this setup, it could help us get an initial idea of how Hive performs >> on very large Iceberg tables for large-scale scan and aggregation workloads. >> >> I have experimented with this dataset for another feature so I can also try >> running 1BRC/1TRC on Hive and share some initial numbers if that would be >> useful for the release planning. >> >> Thanks, >> >> Kokila >> >> >> On Tue, Mar 10, 2026 at 11:43 AM Ayush Saxena <[email protected]> wrote: >>> >>> Hadoop 3.5.0 is currently in the RC stage (RC0 is already available). I >>> think we can reasonably wait for the final 3.5.0 release, and if time and >>> luck favor us, we could even try giving JDK 25 a shot as well. From a >>> timeline perspective, I don’t think we are too late yet. >>> >>> More broadly, my expectation—or perhaps wish—for the upcoming release would >>> be to include Hadoop 3.5 + Iceberg V3 + JDK 25 + REST Catalog related >>> changes. Having these in the release would make it more compelling for >>> users to upgrade, rather than it feeling like just another bug-fix release >>> that gives the impression we are in KTLO mode. :-) >>> >>> As Attila also mentioned above regarding performance tracking, I would >>> definitely like to push that initiative as part of this release. We may not >>> have something perfect right away, but at least we should have a starting >>> point. At the moment, we essentially have nothing in this area. We can >>> always refine the strategy and improve the benchmarks in future releases, >>> but it would be good to have something tangible that we can showcase. >>> Personally, I am inclined towards experimenting around Hive–Iceberg >>> workloads, gathering numbers for specific use cases or queries, and drawing >>> some comparisons. >>> >>> If anyone has already worked on something similar, or has ideas or >>> proposals for how we could approach this, please do share. >>> >>> -Ayush >>> >>> On Mon, 9 Feb 2026 at 14:13, Shohei Okumiya <[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> I'm curious about the remaining blockers. From my perspective, >>>> HIVE-29445 and HIVE-29415 might be needed if we include Iceberg v3. I >>>> think it's possible to put it off until 4.4. HIVE-29415 requires >>>> Iceberg 1.10.2 or 1.11.0 if I understand correctly. >>>> >>>> Hadoop 3.5 is nice, but it hasn't been released yet. Most likely, we >>>> need to keep using 3.4 for a while. >>>> >>>> If we release 4.3 now, I think we should upgrade the Iceberg library >>>> from 1.10.0 to 1.10.1, which has some bug fixes and is not a big >>>> effort. >>>> >>>> Regards, >>>> Okumin >>>> >>>> On Thu, Jan 22, 2026 at 7:44 PM László Bodor <[email protected]> >>>> wrote: >>>> > >>>> > As to: >>>> > >>>> > #4 Hadoop 3.5 support would be great. Do we plan to include a newer Tez >>>> > version in 4.5? From what I can see, a significant number of changes >>>> > have recently landed in the repository. >>>> > >>>> > I don’t think Tez will reach 1.0.0 before Hive 4.5. Given the major >>>> > version milestone, we’re aiming to push more changes and are less afraid >>>> > of breaking things. So unless there’s something blocking, I believe Hive >>>> > 4.5 can continue to use Tez 0.10.5. My personal expectation for Tez >>>> > 1.0.0 is "sometime later this year". >>>> > >>>> > >>>> > On Tue, 20 Jan 2026 at 15:45, Ayush Saxena <[email protected]> wrote: >>>> >> >>>> >> Hi Attila, >>>> >> Regarding: >>>> >> >>>> >>> As you mentioned, Iceberg v3 is a major part of this release. I fully >>>> >>> agree, and I think we should clearly highlight that Hive is one of the >>>> >>> core engines supporting Iceberg v3. Potentially even earlier than >>>> >>> Trino or other competitors. One thing I would like to put attention to >>>> >>> (coming from discussions with the Apache Impala team) is that the >>>> >>> Vector Delete spec seems to have changed, with row-lineage becoming a >>>> >>> prerequisite. As far as I remember, this is not yet implemented in >>>> >>> Hive. If we want Hive to officially support Iceberg v3 with vector >>>> >>> deletes, we should verify and address this gap. >>>> >>> https://iceberg.apache.org/spec/#row-lineage >>>> >> >>>> >> >>>> >> ----- >>>> >> I’m not entirely sure what the issue is on the Impala side. Iceberg V3 >>>> >> writes and Deletion Vectors are working correctly in Hive, even with >>>> >> the latest Iceberg version. As far as I know, Iceberg V3 does not allow >>>> >> committing a snapshot unless row IDs are populated. We also have tests >>>> >> in place that cover writes and deletes for Iceberg V3. >>>> >> >>>> >> We don’t have anything explicit for row lineage because Hive relies on >>>> >> Iceberg writers; we haven’t implemented custom writers. As a result, >>>> >> the Iceberg layer is responsible for populating the row IDs and the >>>> >> next row ID, and that seems to be working as expected. >>>> >> >>>> >> I tested this locally and verified the metadata files, which clearly >>>> >> contain the row IDs. I’m attaching screenshots of the metadata for >>>> >> reference. >>>> >> >>>> >> If Impala is observing unexpected behavior and there turns out to be an >>>> >> issue with our implementation, they can report it via a ticket. >>>> >> However, from a fundamentals point of view, this looks correct on the >>>> >> Hive/Iceberg side. >>>> >> >>>> >> -Ayush >>>> >> >>>> >> >>>> >> On Tue, 20 Jan 2026 at 19:24, Denys Kuzmenko <[email protected]> >>>> >> wrote: >>>> >>> >>>> >>> Hi everyone, >>>> >>> >>>> >>> +1 on collecting the performance numbers. >>>> >>> >>>> >>> I’d like to propose a few additional items to consider: >>>> >>> >>>> >>> #1 REST Catalog HA and vended credentials support >>>> >>> - HIVE-29391, >>>> >>> - HIVE-29228 >>>> >>> >>>> >>> #2 Federated Catalog support >>>> >>> - HIVE-28879 >>>> >>> >>>> >>> #3 Kubernetes manifests / Helm chart for Apache Hive deployment >>>> >>> >>>> >>> #4 New V3 items (that I am aware of) >>>> >>> >>>> >>> 1. VARIANT shredding: >>>> >>> - HIVE-29287, >>>> >>> - HIVE-29354 >>>> >>> >>>> >>> 2. Z-order support for Iceberg tables: >>>> >>> - HIVE-29132 >>>> >>> >>>> >>> Best regards, >>>> >>> Denys
