Regarding performance benchmarking, we should have a way to test the actual upstream code. While many - or all - Hive distributors have their own ways of doing this, we as an open-source community don't. The main limitation is the testing setup, because our current single-image (HS2) or HS2+HMS Docker setup is not suitable for this purpose, even though it works wonderfully for quick local testing. That's what's currently being addressed in the scope of https://issues.apache.org/jira/browse/HIVE-29492.
Regards, Laszlo Bodor On Tue, 10 Mar 2026 at 07:38, kokila narayanan <[email protected]> wrote: > > Regarding the performance tracking initiative and Hive-Iceberg workloads, > one possible starting point could be leveraging the *1 Trillion Row > Challenge (1TRC)* style benchmarks. > > The Impala community has already experimented with something along these > lines and they have even extended it to work with Iceberg tables as well: > https://github.com/boroknagyz/impala-1trc > > The main query is relatively simple aggregation query: > > SELECT station, min(measure), max(measure), avg(measure) > FROM measurements_1trc > GROUP BY station > ORDER BY station; > > While this benchmark is quite simple and only tests a single type of > query, it could still be a good starting point. It does not cover the wider > variety of queries we usually see in Hive workloads (like joins, filters, > or more complex aggregations), but it is easy to reproduce and run. > > With this setup, it could help us get an initial idea of how Hive performs > on very large Iceberg tables for large-scale scan and aggregation workloads. > > I have experimented with this dataset for another feature so I can also > try running 1BRC/1TRC on Hive and share some initial numbers if that would > be useful for the release planning. > > Thanks, > > Kokila > > On Tue, Mar 10, 2026 at 11:43 AM Ayush Saxena <[email protected]> wrote: > >> Hadoop 3.5.0 is currently in the RC stage (RC0 is already available). I >> think we can reasonably wait for the final 3.5.0 release, and if time and >> luck favor us, we could even try giving JDK 25 a shot as well. From a >> timeline perspective, I don’t think we are too late yet. >> >> More broadly, my expectation—or perhaps wish—for the upcoming release >> would be to include Hadoop 3.5 + Iceberg V3 + JDK 25 + REST Catalog related >> changes. Having these in the release would make it more compelling for >> users to upgrade, rather than it feeling like just another bug-fix release >> that gives the impression we are in KTLO mode. :-) >> >> As Attila also mentioned above regarding performance tracking, I would >> definitely like to push that initiative as part of this release. We may not >> have something perfect right away, but at least we should have a starting >> point. At the moment, we essentially have nothing in this area. We can >> always refine the strategy and improve the benchmarks in future releases, >> but it would be good to have something tangible that we can showcase. >> Personally, I am inclined towards experimenting around Hive–Iceberg >> workloads, gathering numbers for specific use cases or queries, and drawing >> some comparisons. >> >> If anyone has already worked on something similar, or has ideas or >> proposals for how we could approach this, please do share. >> >> -Ayush >> >> On Mon, 9 Feb 2026 at 14:13, Shohei Okumiya <[email protected]> wrote: >> >>> Hi, >>> >>> I'm curious about the remaining blockers. From my perspective, >>> HIVE-29445 and HIVE-29415 might be needed if we include Iceberg v3. I >>> think it's possible to put it off until 4.4. HIVE-29415 requires >>> Iceberg 1.10.2 or 1.11.0 if I understand correctly. >>> >>> Hadoop 3.5 is nice, but it hasn't been released yet. Most likely, we >>> need to keep using 3.4 for a while. >>> >>> If we release 4.3 now, I think we should upgrade the Iceberg library >>> from 1.10.0 to 1.10.1, which has some bug fixes and is not a big >>> effort. >>> >>> Regards, >>> Okumin >>> >>> On Thu, Jan 22, 2026 at 7:44 PM László Bodor <[email protected]> >>> wrote: >>> > >>> > As to: >>> > >>> > #4 Hadoop 3.5 support would be great. Do we plan to include a newer >>> Tez version in 4.5? From what I can see, a significant number of changes >>> have recently landed in the repository. >>> > >>> > I don’t think Tez will reach 1.0.0 before Hive 4.5. Given the major >>> version milestone, we’re aiming to push more changes and are less afraid of >>> breaking things. So unless there’s something blocking, I believe Hive 4.5 >>> can continue to use Tez 0.10.5. My personal expectation for Tez 1.0.0 is >>> "sometime later this year". >>> > >>> > >>> > On Tue, 20 Jan 2026 at 15:45, Ayush Saxena <[email protected]> wrote: >>> >> >>> >> Hi Attila, >>> >> Regarding: >>> >> >>> >>> As you mentioned, Iceberg v3 is a major part of this release. I >>> fully agree, and I think we should clearly highlight that Hive is one of >>> the core engines supporting Iceberg v3. Potentially even earlier than Trino >>> or other competitors. One thing I would like to put attention to (coming >>> from discussions with the Apache Impala team) is that the Vector Delete >>> spec seems to have changed, with row-lineage becoming a prerequisite. As >>> far as I remember, this is not yet implemented in Hive. If we want Hive to >>> officially support Iceberg v3 with vector deletes, we should verify and >>> address this gap. https://iceberg.apache.org/spec/#row-lineage >>> >> >>> >> >>> >> ----- >>> >> I’m not entirely sure what the issue is on the Impala side. Iceberg >>> V3 writes and Deletion Vectors are working correctly in Hive, even with the >>> latest Iceberg version. As far as I know, Iceberg V3 does not allow >>> committing a snapshot unless row IDs are populated. We also have tests in >>> place that cover writes and deletes for Iceberg V3. >>> >> >>> >> We don’t have anything explicit for row lineage because Hive relies >>> on Iceberg writers; we haven’t implemented custom writers. As a result, the >>> Iceberg layer is responsible for populating the row IDs and the next row >>> ID, and that seems to be working as expected. >>> >> >>> >> I tested this locally and verified the metadata files, which clearly >>> contain the row IDs. I’m attaching screenshots of the metadata for >>> reference. >>> >> >>> >> If Impala is observing unexpected behavior and there turns out to be >>> an issue with our implementation, they can report it via a ticket. However, >>> from a fundamentals point of view, this looks correct on the Hive/Iceberg >>> side. >>> >> >>> >> -Ayush >>> >> >>> >> >>> >> On Tue, 20 Jan 2026 at 19:24, Denys Kuzmenko <[email protected]> >>> wrote: >>> >>> >>> >>> Hi everyone, >>> >>> >>> >>> +1 on collecting the performance numbers. >>> >>> >>> >>> I’d like to propose a few additional items to consider: >>> >>> >>> >>> #1 REST Catalog HA and vended credentials support >>> >>> - HIVE-29391, >>> >>> - HIVE-29228 >>> >>> >>> >>> #2 Federated Catalog support >>> >>> - HIVE-28879 >>> >>> >>> >>> #3 Kubernetes manifests / Helm chart for Apache Hive deployment >>> >>> >>> >>> #4 New V3 items (that I am aware of) >>> >>> >>> >>> 1. VARIANT shredding: >>> >>> - HIVE-29287, >>> >>> - HIVE-29354 >>> >>> >>> >>> 2. Z-order support for Iceberg tables: >>> >>> - HIVE-29132 >>> >>> >>> >>> Best regards, >>> >>> Denys >>> >>
