[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

liancheng Tue, 04 Aug 2015 08:56:25 -0700

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7929#discussion_r36205374

--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala ---
@@ -62,6 +63,39 @@ private[hive] class ClientWrapper(
extends ClientInterface
with Logging {

+ overrideHadoopShims()
+
+ // !! HACK ALERT !!
+ //
+ // This method is used to workaround CDH Hadoop versions like
2.0.0-mr1-cdh4.1.1.
+ //
+ // Internally, Hive `ShimLoader` tries to load different versions of
Hadoop shims by checking
+ // version information gathered from Hadoop jar files. If the major
version number is 1,
+ // `Hadoop20SShims` will be loaded. Otherwise, if the major version
number is 2, `Hadoop23Shims`
+ // will be chosen. However, CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1
have 2 as major version
+ // number, but contain Hadoop 1 code. This confuses Hive `ShimLoader`
and loads wrong version of
--- End diff --

@srowen Thanks for the information! I'm not familiar with CDH versions, and
I was actually quite confused when investigating this issue:

1. In [this Jenkins build] [1], `-Dhadoop.version=2.0.0-mr1-cdh4.1.1` is
listed together with `-Phadoop-1`, which indicates that this version is for
Hadoop 1. But shouldn't 2.0.0 be a Hadoop 2 version?
1. What exactly does the `mr1` part in the version string mean? I could
only find information about it in the POM file downloaded from [Cloudera
repository] [2], but couldn't find it in Cloudera GitHub hadoop-common [branch
cdh4-2.0.0_4.0.1] [3]. I may miss something here though, since I haven't
grep-ed the whole source tree of this version yet. For now, I see it as a
variation of Hadoop 2.0.0 built for CDH 4.1.1, and only provides Hadoop 1
functionality. Is it?

As for the "bigger problems" part, I do think wrong Hadoop shims are being
used when running Hive against these `2.*.*-mr1-cdh*` versions. However, the
`ShimLoader` code has been there for quite a while. It would be really weird if
this is the case and no one has ever noticed it.

[1]:
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/3157/
[2]:
https://repository.cloudera.com/artifactory/public/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.1.1/
[3]: https://github.com/cloudera/hadoop-common/tree/cdh4-2.0.0_4.1.1



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

Reply via email to