[ https://issues.apache.org/jira/browse/HUDI-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933015#comment-16933015 ]
Vinoth Chandar edited comment on HUDI-260 at 9/19/19 3:06 AM: -------------------------------------------------------------- Tried to search around for this, it seems like any code that's used in a closure (i.e ones that spark will serialize from driver to executor) needs to be in --jars and not extraClassPath. I figure they use different class loaders. Other theory is if we use Spark Java APIs, then they need to be placed under the `jars` folder and thats the way to go.. My guess is Java lambda to scala to codegen fails someplace, if specified via extraClassPath I am still looking.. but does not seem like an issue with how we are bundling/shading (there are no spark/scala jars there) Anyways, having trouble reproducing this error. For me, somehow its not even getting picked up (spark.jars works) {code} root@adhoc-2:/opt# cat /opt/spark/conf/spark-defaults.conf ... spark.driver.extraClassPath /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar spark.executor.extraClassPath /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar root@adhoc-2:/opt# $SPARK_INSTALL/bin/spark-shell --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client --driver-memory 1G --executor-memory 3G --num-executors 1 19/09/19 02:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://adhoc-2:4040 Spark context available as 'sc' (master = local[2], app id = local-1568861680731). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.hudi.DataSourceReadOptions; <console>:23: error: object hudi is not a member of package org.apache import org.apache.hudi.DataSourceReadOptions; ^ scala> {code} can you give me a reproducible setup on the demo containers? was (Author: vc): Tried to search around for this, it seems like any code that's used in a closure (i.e ones that spark will serialize from driver to executor) needs to be in --jars and not extraClassPath. I figure they use different class loaders. I am still looking.. Anyways, having trouble reproducing this error. For me, somehow its not even getting picked up (spark.jars works) {code} root@adhoc-2:/opt# cat /opt/spark/conf/spark-defaults.conf ... spark.driver.extraClassPath /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar spark.executor.extraClassPath /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar root@adhoc-2:/opt# $SPARK_INSTALL/bin/spark-shell --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client --driver-memory 1G --executor-memory 3G --num-executors 1 19/09/19 02:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://adhoc-2:4040 Spark context available as 'sc' (master = local[2], app id = local-1568861680731). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.hudi.DataSourceReadOptions; <console>:23: error: object hudi is not a member of package org.apache import org.apache.hudi.DataSourceReadOptions; ^ scala> {code} can you give me a reproducible setup on the demo containers? > Hudi Spark Bundle does not work when passed in extraClassPath option > -------------------------------------------------------------------- > > Key: HUDI-260 > URL: https://issues.apache.org/jira/browse/HUDI-260 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Spark datasource, SparkSQL Support > Reporter: Vinoth Chandar > Assignee: Vinoth Chandar > Priority: Major > > On EMR's side we have the same findings. *a + b + c +d* work in the following > cases: > * The bundle jar (with databricks-avro shaded) is specified using *--jars* > or *spark.jars* option > * The bundle jar (with databricks-avro shaded) is placed in the Spark Home > jars folder i.e. */usr/lib/spark/jars* folder > However, it does not work if the jar is specified using > *spark.driver.extraClassPath* and *spark.executor.extraClassPath* options > which is what EMR uses to configure external dependencies. Although we can > drop the jar in */usr/lib/spark/jars* folder, but I am not sure if it is > recommended because that folder is supposed to contain the jars coming from > spark. Extra dependencies from users side would be better off specified > through *extraClassPath* option. -- This message was sent by Atlassian Jira (v8.3.4#803005)