Hello All,

As some of you might have noticed I have been working on IMPALA-8369
<https://issues.apache.org/jira/browse/IMPALA-8369> and I have a WIP patch
on gerrit <https://gerrit.cloudera.org/#/c/13005/>. The current plan to is
build using Hive-3 libraries while keeping compatibility with Hive-2. This
gives us the advantage of keeping only one branch which works with both the
setups. If we hit roadblocks for which don't have any good solutions, the
fall-back could be to branch off and create a separate branch for HMS-3
support.

The patch attempts to add support into Impala the ability to talk to
HMS-3.x while keeping the ability to talk to HMS-2 intact. This is done
using the following approach:

1. Reduce the unnecessary dependencies from Hive (specifically hive-exec
jar which is a fat jar including almost all of the hive code). This is a
in-general good thing to do in my opinion so that we don't unintentionally
add compile time dependencies to non-public APIs of Hive. It introduces a
new shaded-deps module where we exclude all the unnecessary code from the
hive-exec.jar to create a reduced jar which we depend on currently.
2. Change the build scripts so that we use Hive 3 binaries to compile. The
toolchain is updated with a custom Hive build (will change it to official
builds once I have the hive patches merged). The metastore maintains thrift
wire compatibility with older releases. What is missing that when you are
using HMS3 client you cannot talk to HMS2 because Hive doesn't gaurantee
backwards compatibility from client perspective (newer client talking to
older server).  This needs some fixing on Hive side (HIVE-21596) which I am
also currently working on in parallel. The working prototype which I have
been using works well so far for this usecase (HMS3 client talking to HMS2).
3. Additionally, there were some fixes which are needed from Hive side
(HIVE-21586) to make sure Impala can compile using Hive 3 libraries.

The advantages of this approach are:
1 .We get to maintain only one branch of code and it works with both HMS-2
and HMS-3 based deployments. I have been able to run the existing tests
against HMS-2 with the patch. There are still 3 tests which fail but I
think we can fix them too. Running tests against HMS-3 may need some more
work and will be targetted in a separate JIRA.
2. We can start supporting new features of HMS like ( eg transactional
tables).

There are a few caveats:
1. Some of the built-in functions in Hive (UDFs) moved from the deprecated
UDF interface to the GenericUDF API. Since Impala currently only supports
UDF execution then built-in functions (so far I have found UDFLength,
UDFYear, UDFHour) will not work when we start using Hive 3 binaries. In
order to fix this we should add support for GenericUDFs similar to the UDFs
2. We need some additional patches on top of Hive 3.1.0 like the two above
to build against Hive 3

The alternative to this approach is to branch off and have separate
branches for Hive-2 and Hive-3 support. This would mean more cherry-picking
and maintenance to keep each of these branch up-to-date and multiple
release cadence. Eventually, one of the branch will become the main
development branch after which we can retire the other line.

Let me know if this all sounds reasonable or if there are any blocker
concerns on this.

Thanks,
Vihang

Reply via email to