I’m seeing a strange issue on EMR which I posted about here <https://forums.aws.amazon.com/thread.jspa?threadID=248805&tstart=0&messageID=790954#790954> .
In brief, when I try to import a UDF I’ve defined, Python somehow fails to find Spark. This exact code works for me locally and works on our on-premises CDH cluster under YARN. This is the traceback: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o89.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-97-35-12.ec2.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/worker.py", line 91, in read_udfs _, udf = read_single_udf(pickleSer, infile) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/worker.py", line 78, in read_single_udf f, return_type = read_command(pickleSer, infile) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command command = serializer._read_with_length(file) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length return self.loads(obj) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 451, in loads return pickle.loads(obj, encoding=encoding) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/splinkr/person.py", line 7, in <module> from splinkr.util import repartition_to_size File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/splinkr/util.py", line 34, in <module> containsNull=False, File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/sql/functions.py", line 1872, in udf return UserDefinedFunction(f, returnType) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/sql/functions.py", line 1830, in __init__ self._judf = self._create_judf(name) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/sql/functions.py", line 1834, in _create_judf sc = SparkContext.getOrCreate() File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/context.py", line 310, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/context.py", line 115, in __init__ SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/context.py", line 259, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway(conf) File "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_000002/pyspark.zip/pyspark/java_gateway.py", line 77, in launch_gateway proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env) File "/usr/lib64/python3.5/subprocess.py", line 950, in __init__ restore_signals, start_new_session) File "/usr/lib64/python3.5/subprocess.py", line 1544, in _execute_child raise child_exception_type(errno_num, err_msg) FileNotFoundError: [Errno 2] No such file or directory: './bin/spark-submit' Does anyone have clues about what might be going on? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-PySpark-UDFs-and-SPARK-HOME-only-on-EMR-tp28778.html Sent from the Apache Spark User List mailing list archive at Nabble.com.