[spark] branch branch-2.4 updated: [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork

gurwls223 Fri, 21 Dec 2018 00:10:59 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 90a14d5  [SPARK-26422][R] Support to disable Hive support in SparkR 
even for Hadoop versions unsupported by Hive fork
90a14d5 is described below

commit 90a14d58b4e87a603e35a9ab679f6049b10e9c7b
Author: Hyukjin Kwon <gurwls...@apache.org>
AuthorDate: Fri Dec 21 16:09:30 2018 +0800

    [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop 
versions unsupported by Hive fork
    
    ## What changes were proposed in this pull request?
    
    Currently,  even if I explicitly disable Hive support in SparkR session as 
below:
    
    ```r
    sparkSession <- sparkR.session("local[4]", "SparkR", 
Sys.getenv("SPARK_HOME"),
                                   enableHiveSupport = FALSE)
    ```
    
    produces when the Hadoop version is not supported by our Hive fork:
    
    ```
    java.lang.reflect.InvocationTargetException
    ...
    Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major 
version number: 3.1.1.3.1.0.0-78
        at 
org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
        at 
org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
        at 
org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
        at 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
        ... 43 more
    Error in handleErrors(returnStatus, conn) :
      java.lang.ExceptionInInitializerError
        at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:193)
        at 
org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1116)
        at 
org.apache.spark.sql.api.r.SQLUtils$.getOrCreateSparkSession(SQLUtils.scala:52)
        at 
org.apache.spark.sql.api.r.SQLUtils.getOrCreateSparkSession(SQLUtils.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    ```
    
    The root cause is that:
    
    ```
    SparkSession.hiveClassesArePresent
    ```
    
    check if the class is loadable or not to check if that's in classpath but 
`org.apache.hadoop.hive.conf.HiveConf` has a check for Hadoop version as static 
logic which is executed right away. This throws an `IllegalArgumentException` 
and that's not caught:
    
    
https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L1113-L1121
    
    So, currently, if users have a Hive built-in Spark with unsupported Hadoop 
version by our fork (namely 3+), there's no way to use SparkR even though it 
could work.
    
    This PR just propose to change the order of bool comparison so that we can 
don't execute `SparkSession.hiveClassesArePresent` when:
    
      1. `enableHiveSupport` is explicitly disabled
      2. `spark.sql.catalogImplementation` is `in-memory`
    
    so that we **only** check `SparkSession.hiveClassesArePresent` when Hive 
support is explicitly enabled by short circuiting.
    
    ## How was this patch tested?
    
    It's difficult to write a test since we don't run tests against Hadoop 3 
yet. See https://github.com/apache/spark/pull/21588. Manually tested.
    
    Closes #23356 from HyukjinKwon/SPARK-26422.
    
    Authored-by: Hyukjin Kwon <gurwls...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
    (cherry picked from commit 305e9b5ad22b428501fd42d3730d73d2e09ad4c5)
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 .../src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
index af20764..4c71795 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
@@ -49,9 +49,17 @@ private[sql] object SQLUtils extends Logging {
       sparkConfigMap: JMap[Object, Object],
       enableHiveSupport: Boolean): SparkSession = {
     val spark =
-      if (SparkSession.hiveClassesArePresent && enableHiveSupport &&
+      if (enableHiveSupport &&
           jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, 
"hive").toLowerCase(Locale.ROOT) ==
-            "hive") {
+            "hive" &&
+          // Note that the order of conditions here are on purpose.
+          // `SparkSession.hiveClassesArePresent` checks if Hive's `HiveConf` 
is loadable or not;
+          // however, `HiveConf` itself has some static logic to check if 
Hadoop version is
+          // supported or not, which throws an `IllegalArgumentException` if 
unsupported.
+          // If this is checked first, there's no way to disable Hive support 
in the case above.
+          // So, we intentionally check if Hive classes are loadable or not 
only when
+          // Hive support is explicitly enabled by short-circuiting. See also 
SPARK-26422.
+          SparkSession.hiveClassesArePresent) {
         
SparkSession.builder().sparkContext(withHiveExternalCatalog(jsc.sc)).getOrCreate()
       } else {
         if (enableHiveSupport) {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork

Reply via email to