Re: [DISCUSS] Next Release (4.3.0)

László Bodor Tue, 10 Mar 2026 00:02:47 -0700

Regarding performance benchmarking, we should have a way to test the actual
upstream code. While many - or all - Hive distributors have their own ways
of doing this, we as an open-source community don't. The main limitation is
the testing setup, because our current single-image (HS2) or HS2+HMS Docker
setup is not suitable for this purpose, even though it works wonderfully
for quick local testing.
That's what's currently being addressed in the scope of
https://issues.apache.org/jira/browse/HIVE-29492.


Regards,
Laszlo Bodor


On Tue, 10 Mar 2026 at 07:38, kokila narayanan <[email protected]>
wrote:

>
> Regarding the performance tracking initiative and Hive-Iceberg workloads,
> one possible starting point could be leveraging the *1 Trillion Row
> Challenge (1TRC)* style benchmarks.
>
> The Impala community has already experimented with something along these
> lines and they have even extended it to work with Iceberg tables as well:
> https://github.com/boroknagyz/impala-1trc
>
> The main query is relatively simple aggregation query:
>
> SELECT station, min(measure), max(measure), avg(measure)
> FROM measurements_1trc
> GROUP BY station
> ORDER BY station;
>
> While this benchmark is quite simple and only tests a single type of
> query, it could still be a good starting point. It does not cover the wider
> variety of queries we usually see in Hive workloads (like joins, filters,
> or more complex aggregations), but it is easy to reproduce and run.
>
> With this setup, it could help us get an initial idea of how Hive performs
> on very large Iceberg tables for large-scale scan and aggregation workloads.
>
> I have experimented with this dataset for another feature so I can also
> try running 1BRC/1TRC on Hive and share some initial numbers if that would
> be useful for the release planning.
>
> Thanks,
>
> Kokila
>
> On Tue, Mar 10, 2026 at 11:43 AM Ayush Saxena <[email protected]> wrote:
>
>> Hadoop 3.5.0 is currently in the RC stage (RC0 is already available). I
>> think we can reasonably wait for the final 3.5.0 release, and if time and
>> luck favor us, we could even try giving JDK 25 a shot as well. From a
>> timeline perspective, I don’t think we are too late yet.
>>
>> More broadly, my expectation—or perhaps wish—for the upcoming release
>> would be to include Hadoop 3.5 + Iceberg V3 + JDK 25 + REST Catalog related
>> changes. Having these in the release would make it more compelling for
>> users to upgrade, rather than it feeling like just another bug-fix release
>> that gives the impression we are in KTLO mode. :-)
>>
>> As Attila also mentioned above regarding performance tracking, I would
>> definitely like to push that initiative as part of this release. We may not
>> have something perfect right away, but at least we should have a starting
>> point. At the moment, we essentially have nothing in this area. We can
>> always refine the strategy and improve the benchmarks in future releases,
>> but it would be good to have something tangible that we can showcase.
>> Personally, I am inclined towards experimenting around Hive–Iceberg
>> workloads, gathering numbers for specific use cases or queries, and drawing
>> some comparisons.
>>
>> If anyone has already worked on something similar, or has ideas or
>> proposals for how we could approach this, please do share.
>>
>> -Ayush
>>
>> On Mon, 9 Feb 2026 at 14:13, Shohei Okumiya <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I'm curious about the remaining blockers. From my perspective,
>>> HIVE-29445 and HIVE-29415 might be needed if we include Iceberg v3. I
>>> think it's possible to put it off until 4.4. HIVE-29415 requires
>>> Iceberg 1.10.2 or 1.11.0 if I understand correctly.
>>>
>>> Hadoop 3.5 is nice, but it hasn't been released yet. Most likely, we
>>> need to keep using 3.4 for a while.
>>>
>>> If we release 4.3 now, I think we should upgrade the Iceberg library
>>> from 1.10.0 to 1.10.1, which has some bug fixes and is not a big
>>> effort.
>>>
>>> Regards,
>>> Okumin
>>>
>>> On Thu, Jan 22, 2026 at 7:44 PM László Bodor <[email protected]>
>>> wrote:
>>> >
>>> > As to:
>>> >
>>> > #4 Hadoop 3.5 support would be great. Do we plan to include a newer
>>> Tez version in 4.5? From what I can see, a significant number of changes
>>> have recently landed in the repository.
>>> >
>>> > I don’t think Tez will reach 1.0.0 before Hive 4.5. Given the major
>>> version milestone, we’re aiming to push more changes and are less afraid of
>>> breaking things. So unless there’s something blocking, I believe Hive 4.5
>>> can continue to use Tez 0.10.5. My personal expectation for Tez 1.0.0 is
>>> "sometime later this year".
>>> >
>>> >
>>> > On Tue, 20 Jan 2026 at 15:45, Ayush Saxena <[email protected]> wrote:
>>> >>
>>> >> Hi Attila,
>>> >> Regarding:
>>> >>
>>> >>> As you mentioned, Iceberg v3 is a major part of this release. I
>>> fully agree, and I think we should clearly highlight that Hive is one of
>>> the core engines supporting Iceberg v3. Potentially even earlier than Trino
>>> or other competitors. One thing I would like to put attention to (coming
>>> from discussions with the Apache Impala team) is that the Vector Delete
>>> spec seems to have changed, with row-lineage becoming a prerequisite. As
>>> far as I remember, this is not yet implemented in Hive. If we want Hive to
>>> officially support Iceberg v3 with vector deletes, we should verify and
>>> address this gap. https://iceberg.apache.org/spec/#row-lineage
>>> >>
>>> >>
>>> >> -----
>>> >> I’m not entirely sure what the issue is on the Impala side. Iceberg
>>> V3 writes and Deletion Vectors are working correctly in Hive, even with the
>>> latest Iceberg version. As far as I know, Iceberg V3 does not allow
>>> committing a snapshot unless row IDs are populated. We also have tests in
>>> place that cover writes and deletes for Iceberg V3.
>>> >>
>>> >> We don’t have anything explicit for row lineage because Hive relies
>>> on Iceberg writers; we haven’t implemented custom writers. As a result, the
>>> Iceberg layer is responsible for populating the row IDs and the next row
>>> ID, and that seems to be working as expected.
>>> >>
>>> >> I tested this locally and verified the metadata files, which clearly
>>> contain the row IDs. I’m attaching screenshots of the metadata for
>>> reference.
>>> >>
>>> >> If Impala is observing unexpected behavior and there turns out to be
>>> an issue with our implementation, they can report it via a ticket. However,
>>> from a fundamentals point of view, this looks correct on the Hive/Iceberg
>>> side.
>>> >>
>>> >> -Ayush
>>> >>
>>> >>
>>> >> On Tue, 20 Jan 2026 at 19:24, Denys Kuzmenko <[email protected]>
>>> wrote:
>>> >>>
>>> >>> Hi everyone,
>>> >>>
>>> >>> +1 on collecting the performance numbers.
>>> >>>
>>> >>> I’d like to propose a few additional items to consider:
>>> >>>
>>> >>> #1 REST Catalog HA and vended credentials support
>>> >>> - HIVE-29391,
>>> >>> - HIVE-29228
>>> >>>
>>> >>> #2 Federated Catalog support
>>> >>> - HIVE-28879
>>> >>>
>>> >>> #3 Kubernetes manifests / Helm chart for Apache Hive deployment
>>> >>>
>>> >>> #4 New V3 items (that I am aware of)
>>> >>>
>>> >>> 1. VARIANT shredding:
>>> >>>   - HIVE-29287,
>>> >>>   - HIVE-29354
>>> >>>
>>> >>> 2. Z-order support for Iceberg tables:
>>> >>>   - HIVE-29132
>>> >>>
>>> >>> Best regards,
>>> >>> Denys
>>>
>>

Re: [DISCUSS] Next Release (4.3.0)

Reply via email to