Thanks for working on this. I'm interested in the specific impact that this
has on Java UDFs. When compiled with Hive 3, can Impala run Java UDFs using
the deprecated UDF interface? For example, if I have an Impala cluster
running Hive 2 that has custom Hive UDFs using the deprecated UDF
interface, can Impala still use them after moving to an Impala built with
Hive 3? I want to confirm that this is backwards compatible. Do Hive UDFs
ever depend on Hive components on the CLASSPATH? In other words, if Impala
is running with Hive 3 jars on its CLASSPATH, does that impact a legacy
Hive UDFs built against Hive 2?

Depending on how much code needs to change to use Hive 3, an alternative is
to introduce build-time shims for the differences between Hive 2 and Hive
3. This is how the Impala 2 to Impala 3 transition worked (IMPALA-4277:
https://gerrit.cloudera.org/#/c/9716/ ).

Thanks,
Joe

On Thu, Apr 25, 2019 at 3:09 PM Vihang Karajgaonkar <vih...@cloudera.com>
wrote:

> Hello All,
>
> As some of you might have noticed I have been working on IMPALA-8369
> <https://issues.apache.org/jira/browse/IMPALA-8369> and I have a WIP patch
> on gerrit <https://gerrit.cloudera.org/#/c/13005/>. The current plan to is
> build using Hive-3 libraries while keeping compatibility with Hive-2. This
> gives us the advantage of keeping only one branch which works with both the
> setups. If we hit roadblocks for which don't have any good solutions, the
> fall-back could be to branch off and create a separate branch for HMS-3
> support.
>
> The patch attempts to add support into Impala the ability to talk to
> HMS-3.x while keeping the ability to talk to HMS-2 intact. This is done
> using the following approach:
>
> 1. Reduce the unnecessary dependencies from Hive (specifically hive-exec
> jar which is a fat jar including almost all of the hive code). This is a
> in-general good thing to do in my opinion so that we don't unintentionally
> add compile time dependencies to non-public APIs of Hive. It introduces a
> new shaded-deps module where we exclude all the unnecessary code from the
> hive-exec.jar to create a reduced jar which we depend on currently.
> 2. Change the build scripts so that we use Hive 3 binaries to compile. The
> toolchain is updated with a custom Hive build (will change it to official
> builds once I have the hive patches merged). The metastore maintains thrift
> wire compatibility with older releases. What is missing that when you are
> using HMS3 client you cannot talk to HMS2 because Hive doesn't gaurantee
> backwards compatibility from client perspective (newer client talking to
> older server).  This needs some fixing on Hive side (HIVE-21596) which I am
> also currently working on in parallel. The working prototype which I have
> been using works well so far for this usecase (HMS3 client talking to
> HMS2).
> 3. Additionally, there were some fixes which are needed from Hive side
> (HIVE-21586) to make sure Impala can compile using Hive 3 libraries.
>
> The advantages of this approach are:
> 1 .We get to maintain only one branch of code and it works with both HMS-2
> and HMS-3 based deployments. I have been able to run the existing tests
> against HMS-2 with the patch. There are still 3 tests which fail but I
> think we can fix them too. Running tests against HMS-3 may need some more
> work and will be targetted in a separate JIRA.
> 2. We can start supporting new features of HMS like ( eg transactional
> tables).
>
> There are a few caveats:
> 1. Some of the built-in functions in Hive (UDFs) moved from the deprecated
> UDF interface to the GenericUDF API. Since Impala currently only supports
> UDF execution then built-in functions (so far I have found UDFLength,
> UDFYear, UDFHour) will not work when we start using Hive 3 binaries. In
> order to fix this we should add support for GenericUDFs similar to the UDFs
> 2. We need some additional patches on top of Hive 3.1.0 like the two above
> to build against Hive 3
>
> The alternative to this approach is to branch off and have separate
> branches for Hive-2 and Hive-3 support. This would mean more cherry-picking
> and maintenance to keep each of these branch up-to-date and multiple
> release cadence. Eventually, one of the branch will become the main
> development branch after which we can retire the other line.
>
> Let me know if this all sounds reasonable or if there are any blocker
> concerns on this.
>
> Thanks,
> Vihang
>

Reply via email to