marton-bod commented on a change in pull request #1781:
URL: https://github.com/apache/iceberg/pull/1781#discussion_r535277851
##########
File path: build.gradle
##########
@@ -607,6 +607,8 @@ project(':iceberg-hive-runtime') {
if (jdkVersion == '8') {
compile project(':iceberg-hive3')
}
+ // including thrift so that the Tez AM can communicate with the HMS when
using Hive catalog
+ compile("org.apache.thrift:libfb303")
Review comment:
Thanks for your patience again. Here are my answers:
> Shouldn't this already be provided by Hive at runtime? Should Hive be the
one putting this on the AM?
Correct. Hive localizes the hive exec jar onto the classpath of the Tez AM,
via `TezSessionState.createJarLocalResource`. More precisely, it checks which
jar the class `org.apache.hadoop.hive.ql.exec.tez.DagUtils` comes from and
localizes that specific jar for the AM
-- and this should be the `hive-exec.jar`.
> Also another interesting consideration is that Iceberg does not use Thrift
directly, it uses HMSClient. Now hive-runtime does not include HMSClient code
either. So how is Tez AM able to get the HMSClient code but not Thrift?
Correct, it's the `HiveMetaStoreClient` which we use directly in Iceberg.
This class (along with other metastore classes) is already part of the
`hive-exec.jar`, along with most of the Thrift classes as well so Tez has no
problem finding them. The only thing that's missing is the `libfb303`
(Facebook) thrift classes - these are not contained in the `hive-exec.jar`.
This absence has not caused any issues before because Tez has not needed to
communicate with the HMS, but now it does when we use Iceberg tables.
> What is Iceberg doing on the AM that needs to talk with HMS?
Mostly `IcebergInputFormat.getSplits` loading the table from HMS (and
previously `HiveIcebergSerDe` loading the table schema from the HMS - but this
latter should be taken care of by https://github.com/apache/iceberg/pull/1794)
> I've never run into a problem with Iceberg on Hive related to this as Hive
provides libfb303-0.9.3.jar in its lib folder (for Hive 2.3.7). What steps
reproduce the problem? Is it specific to Hive 3?
This problem should be specific to Tez, because the AM does not have access
to the lib folder used by Hive. Not exactly sure how the classpath is handled
when MR launches separate processes.
> If we need to include this, is it okay to not shade this?
I've tried shading it by including `relocate 'com.facebook.fb303',
'org.apache.iceberg.shaded.com.facebook.fb303'` but then Hive unfortunately
failed to find the class when it was shaded.
So to summarize:
- In the current state, Tez does not work out-of-the-box using only the
hive-iceberg-runtime jar. The user also needs to put the `libfb303` jar on
Tez's classpath one way or another. This should be documented. I will create a
PR to add this to the Hive docs.
- While the solution in this PR does solve this problem, the `libfb303`
classes don't really belong in the runtime jar, especially in an unshaded
version (although probably not too risky given there'd be no version conflict
since Hive2 and Hive3 both use 0.9.3), so I'll abandon this change
- A probably better solution would be to create a HIVE change and include
the `libfb303` classes in the `hive-exec.jar` (which already contains the other
thrift classes). Then we can include that change in an upcoming upstream
release and upgrade the version in Iceberg once it's out.
What are your thoughts?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]