Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7929#discussion_r36205374
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala ---
    @@ -62,6 +63,39 @@ private[hive] class ClientWrapper(
       extends ClientInterface
       with Logging {
     
    +  overrideHadoopShims()
    +
    +  // !! HACK ALERT !!
    +  //
    +  // This method is used to workaround CDH Hadoop versions like 
2.0.0-mr1-cdh4.1.1.
    +  //
    +  // Internally, Hive `ShimLoader` tries to load different versions of 
Hadoop shims by checking
    +  // version information gathered from Hadoop jar files.  If the major 
version number is 1,
    +  // `Hadoop20SShims` will be loaded.  Otherwise, if the major version 
number is 2, `Hadoop23Shims`
    +  // will be chosen.  However, CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1 
have 2 as major version
    +  // number, but contain Hadoop 1 code.  This confuses Hive `ShimLoader` 
and loads wrong version of
    --- End diff --
    
    @srowen Thanks for the information! I'm not familiar with CDH versions, and 
I was actually quite confused when investigating this issue:
    
    1. In [this Jenkins build] [1], `-Dhadoop.version=2.0.0-mr1-cdh4.1.1` is 
listed together with `-Phadoop-1`, which indicates that this version is for 
Hadoop 1. But shouldn't 2.0.0 be a Hadoop 2 version?
    1. What exactly does the `mr1` part in the version string mean? I could 
only find information about it in the POM file downloaded from [Cloudera 
repository] [2], but couldn't find it in Cloudera GitHub hadoop-common [branch 
cdh4-2.0.0_4.0.1] [3]. I may miss something here though, since I haven't 
grep-ed the whole source tree of this version yet. For now, I see it as a 
variation of Hadoop 2.0.0 built for CDH 4.1.1, and only provides Hadoop 1 
functionality. Is it?
    
    As for the "bigger problems" part, I do think wrong Hadoop shims are being 
used when running Hive against these `2.*.*-mr1-cdh*` versions. However, the 
`ShimLoader` code has been there for quite a while. It would be really weird if 
this is the case and no one has ever noticed it.
    
    [1]: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/3157/
    [2]: 
https://repository.cloudera.com/artifactory/public/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.1.1/
    [3]: https://github.com/cloudera/hadoop-common/tree/cdh4-2.0.0_4.1.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to