GitHub user mattf opened a pull request:
https://github.com/apache/spark/pull/2313
[SPARK-927] detect numpy at time of use
it is possible for numpy to be installed on the driver node but not on
worker nodes. in such a case, using the rddsampler's constructor to
detect numpy leads to failures on workers as they cannot import numpy
(see below).
the solution here is to detect numpy right before it is used on the
workers.
example code & error -
yum install -y numpy
pyspark
>>> sc.parallelize(range(10000)).sample(False, .1).collect()
...
14/09/07 10:50:01 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4,
node4): org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
File "/home/test/spark/dist/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/test/spark/dist/python/pyspark/serializers.py", line 196, in
dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/test/spark/dist/python/pyspark/serializers.py", line 127, in
dump_stream
for obj in iterator:
File "/home/test/spark/dist/python/pyspark/serializers.py", line 185, in
_batched
for item in iterator:
File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 115, in
func
if self.getUniformSample(split) <= self._fraction:
File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 57, in
getUniformSample
self.initRandomGenerator(split)
File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 42, in
initRandomGenerator
import numpy
ImportError: No module named numpy
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mattf/spark SPARK-927
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2313.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2313
----
commit 7e8a11ce99bdb32bc86de96497036b731bd41e47
Author: Matthew Farrellee <[email protected]>
Date: 2014-09-07T15:26:55Z
[SPARK-927] detect numpy at time of use
it is possible for numpy to be installed on the driver node but not on
worker nodes. in such a case, using the rddsampler's constructor to
detect numpy leads to failures on workers as they cannot import numpy
(see below).
the solution here is to detect numpy right before it is used on the
workers.
example code & error -
yum install -y numpy
pyspark
>>> sc.parallelize(range(10000)).sample(False, .1).collect()
...
14/09/07 10:50:01 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4,
node4): org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
File "/home/test/spark/dist/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/test/spark/dist/python/pyspark/serializers.py", line 196, in
dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/test/spark/dist/python/pyspark/serializers.py", line 127, in
dump_stream
for obj in iterator:
File "/home/test/spark/dist/python/pyspark/serializers.py", line 185, in
_batched
for item in iterator:
File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 115, in
func
if self.getUniformSample(split) <= self._fraction:
File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 57, in
getUniformSample
self.initRandomGenerator(split)
File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 42, in
initRandomGenerator
import numpy
ImportError: No module named numpy
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]