[jira] [Comment Edited] (HUDI-260) Hudi Spark Bundle does not work when passed in extraClassPath option

Vinoth Chandar (Jira) Wed, 18 Sep 2019 20:07:49 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933015#comment-16933015
 ]


Vinoth Chandar edited comment on HUDI-260 at 9/19/19 3:06 AM:
--------------------------------------------------------------

Tried to search around for this, it seems like any code that's used in a 
closure (i.e ones that spark will serialize from driver to executor) needs to 
be in --jars and not extraClassPath. I figure they use different class loaders. 
Other theory is if we use Spark Java APIs, then they need to be placed under 
the `jars` folder and thats the way to go.. My guess is Java lambda to scala to 
codegen fails someplace, if specified via extraClassPath

I am still looking.. but does not seem like an issue with how we are 
bundling/shading (there are no spark/scala jars there)  

Anyways, having trouble reproducing this error. For me, somehow its not even 
getting picked up (spark.jars works)

{code}
root@adhoc-2:/opt# cat /opt/spark/conf/spark-defaults.conf
...
spark.driver.extraClassPath      
/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar
spark.executor.extraClassPath    
/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar

root@adhoc-2:/opt# $SPARK_INSTALL/bin/spark-shell --master local[2] 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --executor-memory 3G --num-executors 1
19/09/19 02:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = 
local-1568861680731).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hudi.DataSourceReadOptions;
<console>:23: error: object hudi is not a member of package org.apache
       import org.apache.hudi.DataSourceReadOptions;
                         ^

scala>
{code}

can you give me a reproducible setup on the demo containers? 


was (Author: vc):
Tried to search around for this, it seems like any code that's used in a 
closure (i.e ones that spark will serialize from driver to executor) needs to 
be in --jars and not extraClassPath. I figure they use different class loaders. 
I am still looking..  

Anyways, having trouble reproducing this error. For me, somehow its not even 
getting picked up (spark.jars works)

{code}
root@adhoc-2:/opt# cat /opt/spark/conf/spark-defaults.conf
...
spark.driver.extraClassPath      
/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar
spark.executor.extraClassPath    
/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar

root@adhoc-2:/opt# $SPARK_INSTALL/bin/spark-shell --master local[2] 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --executor-memory 3G --num-executors 1
19/09/19 02:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = 
local-1568861680731).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hudi.DataSourceReadOptions;
<console>:23: error: object hudi is not a member of package org.apache
       import org.apache.hudi.DataSourceReadOptions;
                         ^

scala>
{code}

can you give me a reproducible setup on the demo containers? 

> Hudi Spark Bundle does not work when passed in extraClassPath option
> --------------------------------------------------------------------
>
>                 Key: HUDI-260
>                 URL: https://issues.apache.org/jira/browse/HUDI-260
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Spark datasource, SparkSQL Support
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> On EMR's side we have the same findings. *a + b + c +d* work in the following 
> cases:
>  * The bundle jar (with databricks-avro shaded) is specified using *--jars* 
> or *spark.jars* option
>  * The bundle jar (with databricks-avro shaded) is placed in the Spark Home 
> jars folder i.e. */usr/lib/spark/jars* folder
> However, it does not work if the jar is specified using 
> *spark.driver.extraClassPath* and *spark.executor.extraClassPath* options 
> which is what EMR uses to configure external dependencies. Although we can 
> drop the jar in */usr/lib/spark/jars* folder, but I am not sure if it is 
> recommended because that folder is supposed to contain the jars coming from 
> spark. Extra dependencies from users side would be better off specified 
> through *extraClassPath* option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-260) Hudi Spark Bundle does not work when passed in extraClassPath option

Reply via email to