[
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046301#comment-14046301
]
Matthew Farrellee commented on SPARK-1394:
------------------------------------------
fyi, this certainly looks like the waitpid(0,...) cleanup handler is cleaning
up more than it should be
also, fyi, if you comment it out you'll start accumulating defunct worker
processes, which is not good
> calling system.platform on worker raises IOError
> ------------------------------------------------
>
> Key: SPARK-1394
> URL: https://issues.apache.org/jira/browse/SPARK-1394
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 0.9.0
> Environment: Tested on Ubuntu and Linux, local and remote master,
> python 2.7.*
> Reporter: Idan Zalzberg
> Labels: pyspark
>
> A simple program that calls system.platform() on the worker fails most of the
> time (it works some times but very rarely).
> This is critical since many libraries call that method (e.g. boto).
> Here is the trace of the attempt to call that method:
> $ /usr/local/spark/bin/pyspark
> Python 2.7.3 (default, Feb 27 2014, 20:00:17)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback
> address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
> 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
> another address
> 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
> 14/04/02 18:18:38 INFO Remoting: Starting remoting
> 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://[email protected]:36640]
> 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://[email protected]:36640]
> 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
> 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at
> /tmp/spark-local-20140402181839-919f
> 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6
> MB.
> 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id
> = ConnectionManagerId(10.33.102.46,43357)
> 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
> 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering
> block manager 10.33.102.46:43357 with 294.6 MB RAM
> 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at
> http://10.33.102.46:51803
> 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
> 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at
> http://10.33.102.46:4040
> 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library
> for your platform... using builtin-java classes where applicable
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 0.9.0
> /_/
> Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
> Spark context available as sc.
> >>> import platform
> >>> sc.parallelize([1]).map(lambda x : platform.system()).collect()
> 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at <stdin>:1
> 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at <stdin>:1) with 1
> output partitions (allowLocal=false)
> 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at
> <stdin>:1)
> 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at
> collect at <stdin>:1), which has no missing parents
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0
> (PythonRDD[1] at collect at <stdin>:1)
> 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in
> 12 ms
> 14/04/02 18:19:17 INFO Executor: Running task ID 0
> PySpark worker failed with exception:
> Traceback (most recent call last):
> File "/usr/local/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "/usr/local/spark/python/pyspark/serializers.py", line 182, in
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
> File "/usr/local/spark/python/pyspark/serializers.py", line 117, in
> dump_stream
> for obj in iterator:
> File "/usr/local/spark/python/pyspark/serializers.py", line 171, in _batched
> for item in iterator:
> File "<stdin>", line 1, in <lambda>
> File "/usr/lib/python2.7/platform.py", line 1306, in system
> return uname()[0]
> File "/usr/lib/python2.7/platform.py", line 1273, in uname
> processor = _syscmd_uname('-p','')
> File "/usr/lib/python2.7/platform.py", line 1030, in _syscmd_uname
> rc = f.close()
> IOError: [Errno 10] No child processes
> 14/04/02 18:19:17 ERROR Executor: Exception in task ID 0
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
> File "/usr/local/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "/usr/local/spark/python/pyspark/serializers.py", line 182, in
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
> File "/usr/local/spark/python/pyspark/serializers.py", line 117, in
> dump_stream
> for obj in iterator:
> File "/usr/local/spark/python/pyspark/serializers.py", line 171, in _batched
> for item in iterator:
> File "<stdin>", line 1, in <lambda>
> File "/usr/lib/python2.7/platform.py", line 1306, in system
> return uname()[0]
> File "/usr/lib/python2.7/platform.py", line 1273, in uname
> processor = _syscmd_uname('-p','')
> File "/usr/lib/python2.7/platform.py", line 1030, in _syscmd_uname
> rc = f.close()
> IOError: [Errno 10] No child processes
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
--
This message was sent by Atlassian JIRA
(v6.2#6252)