[
https://issues.apache.org/jira/browse/SPARK-29574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marcelo Masiero Vanzin resolved SPARK-29574.
--------------------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 26493
[https://github.com/apache/spark/pull/26493]
> spark with user provided hadoop doesn't work on kubernetes
> ----------------------------------------------------------
>
> Key: SPARK-29574
> URL: https://issues.apache.org/jira/browse/SPARK-29574
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 2.4.4
> Reporter: Michał Wesołowski
> Priority: Major
> Fix For: 3.0.0
>
>
> When spark-submit is run with image built with "hadoop free" spark and user
> provided hadoop it fails on kubernetes (hadoop libraries are not on spark's
> classpath).
> I downloaded spark [Pre-built with user-provided Apache
> Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz].
>
> I created docker image with usage of
> [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]].
>
>
> Based on this image (2.4.4-without-hadoop)
> I created another one with Dockerfile
> {code:java}
> FROM spark-py:2.4.4-without-hadoop
> ENV SPARK_HOME=/opt/spark/
> # This is needed for newer kubernetes versions
> ADD
> https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar
> $SPARK_HOME/jars
> COPY spark-env.sh /opt/spark/conf/spark-env.sh
> RUN chmod +x /opt/spark/conf/spark-env.sh
> RUN wget -qO-
> https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
> | tar xz -C /opt/
> ENV HADOOP_HOME=/opt/hadoop-3.2.1
> ENV PATH=${HADOOP_HOME}/bin:${PATH}
> {code}
> Contents of spark-env.sh:
> {code:java}
> #!/usr/bin/env bash
> export SPARK_DIST_CLASSPATH=$(hadoop
> classpath):$HADOOP_HOME/share/hadoop/tools/lib/*
> export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
> {code}
> spark-submit run with image crated this way fails since spark-env.sh is
> overwritten by [volume created when pod
> starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108]
> As quick workaround I tried to modify [entrypoint
> script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh]
> to run spark-env.sh during startup and moving spark-env.sh to a different
> directory.
> Driver starts without issues in this setup however, evethough
> SPARK_DIST_CLASSPATH is set executor is constantly failing:
> {code:java}
> PS
> C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda>
> kubectl logs rda-script-1571835692837-exec-12
> ++ id -u
> + myuid=0
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 0
> + uidentry=root:x:0:0:root:/root:/bin/ash
> + set -e
> + '[' -z root:x:0:0:root:/root:/bin/ash ']'
> + source /opt/spark-env.sh
> +++ hadoop classpath
> ++ export
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++
>
> SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
> ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ echo
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
> SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*
> ++ echo LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> + SPARK_K8S_CMD=executor
> LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark//jars/*'
> + env
> + sed 's/[^=]*=\(.*\)/\1/g'
> + sort -t_ -k4 -n
> + grep SPARK_JAVA_OPT_
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
> -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH"
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores
> $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname
> $SPARK_EXECUTOR_POD_IP)
> + exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java -Xms3g -Xmx3g
> -cp ':/opt/spark//jars/*'
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://coarsegrainedschedu...@rda-script-1571835692837-driver-svc.default.svc:7078
> --executor-id 12 --cores 1 --app-id spark-33382c27389c4b289d79c06d5f631819
> --hostname 10.244.2.24
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/FSDataInputStream
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:186)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.fs.FSDataInputStream
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 3 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]