Re: [DISCUSS] Next Release (4.3.0)

kokila narayanan Mon, 09 Mar 2026 23:38:48 -0700

Regarding the performance tracking initiative and Hive-Iceberg workloads,
one possible starting point could be leveraging the *1 Trillion Row
Challenge (1TRC)* style benchmarks.


The Impala community has already experimented with something along these
lines and they have even extended it to work with Iceberg tables as well:
https://github.com/boroknagyz/impala-1trc

The main query is relatively simple aggregation query:

SELECT station, min(measure), max(measure), avg(measure)
FROM measurements_1trc
GROUP BY station
ORDER BY station;

While this benchmark is quite simple and only tests a single type of query,
it could still be a good starting point. It does not cover the wider
variety of queries we usually see in Hive workloads (like joins, filters,
or more complex aggregations), but it is easy to reproduce and run.

With this setup, it could help us get an initial idea of how Hive performs
on very large Iceberg tables for large-scale scan and aggregation workloads.

I have experimented with this dataset for another feature so I can also try
running 1BRC/1TRC on Hive and share some initial numbers if that would be
useful for the release planning.

Thanks,

Kokila

On Tue, Mar 10, 2026 at 11:43 AM Ayush Saxena <[email protected]> wrote:

> Hadoop 3.5.0 is currently in the RC stage (RC0 is already available). I
> think we can reasonably wait for the final 3.5.0 release, and if time and
> luck favor us, we could even try giving JDK 25 a shot as well. From a
> timeline perspective, I don’t think we are too late yet.
>
> More broadly, my expectation—or perhaps wish—for the upcoming release
> would be to include Hadoop 3.5 + Iceberg V3 + JDK 25 + REST Catalog related
> changes. Having these in the release would make it more compelling for
> users to upgrade, rather than it feeling like just another bug-fix release
> that gives the impression we are in KTLO mode. :-)
>
> As Attila also mentioned above regarding performance tracking, I would
> definitely like to push that initiative as part of this release. We may not
> have something perfect right away, but at least we should have a starting
> point. At the moment, we essentially have nothing in this area. We can
> always refine the strategy and improve the benchmarks in future releases,
> but it would be good to have something tangible that we can showcase.
> Personally, I am inclined towards experimenting around Hive–Iceberg
> workloads, gathering numbers for specific use cases or queries, and drawing
> some comparisons.
>
> If anyone has already worked on something similar, or has ideas or
> proposals for how we could approach this, please do share.
>
> -Ayush
>
> On Mon, 9 Feb 2026 at 14:13, Shohei Okumiya <[email protected]> wrote:
>
>> Hi,
>>
>> I'm curious about the remaining blockers. From my perspective,
>> HIVE-29445 and HIVE-29415 might be needed if we include Iceberg v3. I
>> think it's possible to put it off until 4.4. HIVE-29415 requires
>> Iceberg 1.10.2 or 1.11.0 if I understand correctly.
>>
>> Hadoop 3.5 is nice, but it hasn't been released yet. Most likely, we
>> need to keep using 3.4 for a while.
>>
>> If we release 4.3 now, I think we should upgrade the Iceberg library
>> from 1.10.0 to 1.10.1, which has some bug fixes and is not a big
>> effort.
>>
>> Regards,
>> Okumin
>>
>> On Thu, Jan 22, 2026 at 7:44 PM László Bodor <[email protected]>
>> wrote:
>> >
>> > As to:
>> >
>> > #4 Hadoop 3.5 support would be great. Do we plan to include a newer Tez
>> version in 4.5? From what I can see, a significant number of changes have
>> recently landed in the repository.
>> >
>> > I don’t think Tez will reach 1.0.0 before Hive 4.5. Given the major
>> version milestone, we’re aiming to push more changes and are less afraid of
>> breaking things. So unless there’s something blocking, I believe Hive 4.5
>> can continue to use Tez 0.10.5. My personal expectation for Tez 1.0.0 is
>> "sometime later this year".
>> >
>> >
>> > On Tue, 20 Jan 2026 at 15:45, Ayush Saxena <[email protected]> wrote:
>> >>
>> >> Hi Attila,
>> >> Regarding:
>> >>
>> >>> As you mentioned, Iceberg v3 is a major part of this release. I fully
>> agree, and I think we should clearly highlight that Hive is one of the core
>> engines supporting Iceberg v3. Potentially even earlier than Trino or other
>> competitors. One thing I would like to put attention to (coming from
>> discussions with the Apache Impala team) is that the Vector Delete spec
>> seems to have changed, with row-lineage becoming a prerequisite. As far as
>> I remember, this is not yet implemented in Hive. If we want Hive to
>> officially support Iceberg v3 with vector deletes, we should verify and
>> address this gap. https://iceberg.apache.org/spec/#row-lineage
>> >>
>> >>
>> >> -----
>> >> I’m not entirely sure what the issue is on the Impala side. Iceberg V3
>> writes and Deletion Vectors are working correctly in Hive, even with the
>> latest Iceberg version. As far as I know, Iceberg V3 does not allow
>> committing a snapshot unless row IDs are populated. We also have tests in
>> place that cover writes and deletes for Iceberg V3.
>> >>
>> >> We don’t have anything explicit for row lineage because Hive relies on
>> Iceberg writers; we haven’t implemented custom writers. As a result, the
>> Iceberg layer is responsible for populating the row IDs and the next row
>> ID, and that seems to be working as expected.
>> >>
>> >> I tested this locally and verified the metadata files, which clearly
>> contain the row IDs. I’m attaching screenshots of the metadata for
>> reference.
>> >>
>> >> If Impala is observing unexpected behavior and there turns out to be
>> an issue with our implementation, they can report it via a ticket. However,
>> from a fundamentals point of view, this looks correct on the Hive/Iceberg
>> side.
>> >>
>> >> -Ayush
>> >>
>> >>
>> >> On Tue, 20 Jan 2026 at 19:24, Denys Kuzmenko <[email protected]>
>> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> +1 on collecting the performance numbers.
>> >>>
>> >>> I’d like to propose a few additional items to consider:
>> >>>
>> >>> #1 REST Catalog HA and vended credentials support
>> >>> - HIVE-29391,
>> >>> - HIVE-29228
>> >>>
>> >>> #2 Federated Catalog support
>> >>> - HIVE-28879
>> >>>
>> >>> #3 Kubernetes manifests / Helm chart for Apache Hive deployment
>> >>>
>> >>> #4 New V3 items (that I am aware of)
>> >>>
>> >>> 1. VARIANT shredding:
>> >>>   - HIVE-29287,
>> >>>   - HIVE-29354
>> >>>
>> >>> 2. Z-order support for Iceberg tables:
>> >>>   - HIVE-29132
>> >>>
>> >>> Best regards,
>> >>> Denys
>>
>

Re: [DISCUSS] Next Release (4.3.0)

Reply via email to