[
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906501#comment-15906501
]
Eric O. LEBIGOT (EOL) edited comment on SPARK-18281 at 3/12/17 12:56 PM:
-------------------------------------------------------------------------
Thanks Liang-Chi. Now I do have a minimal example: the small example which is
marked above as working is not working on my machine:
{code}
~/Downloads/spark-2.1.0-bin-hadoop2.7/bin % PYSPARK_DRIVER_PYTHON=ipython2
./pyspark
Python 2.7.13 (default, Dec 23 2016, 05:05:58)
Type "copyright", "credits" or "license" for more information.
IPython 5.3.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
17/03/12 12:46:29 WARN SparkContext: Support for Java 7 is deprecated as of
Spark 2.0.0
2017-03-12 12:46:30.538 java[75598:10832148] Unable to load realm info from
SCDynamicStore
17/03/12 12:46:30 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/03/12 12:46:33 WARN Utils: Service 'SparkUI' could not bind on port 4040.
Attempting port 4041.
17/03/12 12:46:48 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Python version 2.7.13 (default, Dec 23 2016 05:05:58)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([[1],[2],[3]])
...: it = df.toLocalIterator()
...: row = next(it) # this should work
...: df.rdd.getNumPartitions() # returns `48`
...:
---------------------------------------------------------------------------
timeout Traceback (most recent call last)
<ipython-input-1-828d0f5b5ce8> in <module>()
1 df = spark.createDataFrame([[1],[2],[3]])
2 it = df.toLocalIterator()
----> 3 row = next(it) # this should work
4 df.rdd.getNumPartitions() # returns `48`
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.pyc in
_load_from_socket(port, serializer)
138 try:
139 rf = sock.makefile("rb", 65536)
--> 140 for item in serializer.load_stream(rf):
141 yield item
142 finally:
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.pyc
in load_stream(self, stream)
142 while True:
143 try:
--> 144 yield self._read_with_length(stream)
145 except EOFError:
146 return
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.pyc
in _read_with_length(self, stream)
159
160 def _read_with_length(self, stream):
--> 161 length = read_int(stream)
162 if length == SpecialLengths.END_OF_DATA_SECTION:
163 raise EOFError
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.pyc
in read_int(stream)
553
554 def read_int(stream):
--> 555 length = stream.read(4)
556 if not length:
557 raise EOFError
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.pyc
in read(self, size)
382 # fragmentation issues on many platforms.
383 try:
--> 384 data = self._sock.recv(left)
385 except error, e:
386 if e.args[0] == EINTR:
timeout: timed out
{code}
The number of partitions is 4.
Configuration:
- latest macOS Sierra (10.12.3),
- IPython, etc. provided through MacPorts,
- no special Spark configuration except for the verbosity level,
- nothing else running on my machine (MacBook early 2015).
was (Author: lebigot):
Thanks Liang-Chi. Now I do have a minimal example: the small example which is
marked above as working is not working on my machine:
{code}
~/Downloads/spark-2.1.0-bin-hadoop2.7/bin % PYSPARK_DRIVER_PYTHON=ipython2
./pyspark
Python 2.7.13 (default, Dec 23 2016, 05:05:58)
Type "copyright", "credits" or "license" for more information.
IPython 5.3.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
17/03/12 12:46:29 WARN SparkContext: Support for Java 7 is deprecated as of
Spark 2.0.0
2017-03-12 12:46:30.538 java[75598:10832148] Unable to load realm info from
SCDynamicStore
17/03/12 12:46:30 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/03/12 12:46:33 WARN Utils: Service 'SparkUI' could not bind on port 4040.
Attempting port 4041.
17/03/12 12:46:48 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Python version 2.7.13 (default, Dec 23 2016 05:05:58)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([[1],[2],[3]])
...: it = df.toLocalIterator()
...: row = next(it) # this should work
...: df.rdd.getNumPartitions() # returns `48`
...:
---------------------------------------------------------------------------
timeout Traceback (most recent call last)
<ipython-input-1-828d0f5b5ce8> in <module>()
1 df = spark.createDataFrame([[1],[2],[3]])
2 it = df.toLocalIterator()
----> 3 row = next(it) # this should work
4 df.rdd.getNumPartitions() # returns `48`
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.pyc in
_load_from_socket(port, serializer)
138 try:
139 rf = sock.makefile("rb", 65536)
--> 140 for item in serializer.load_stream(rf):
141 yield item
142 finally:
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.pyc
in load_stream(self, stream)
142 while True:
143 try:
--> 144 yield self._read_with_length(stream)
145 except EOFError:
146 return
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.pyc
in _read_with_length(self, stream)
159
160 def _read_with_length(self, stream):
--> 161 length = read_int(stream)
162 if length == SpecialLengths.END_OF_DATA_SECTION:
163 raise EOFError
/Users/lebigot/Downloads/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.pyc
in read_int(stream)
553
554 def read_int(stream):
--> 555 length = stream.read(4)
556 if not length:
557 raise EOFError
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.pyc
in read(self, size)
382 # fragmentation issues on many platforms.
383 try:
--> 384 data = self._sock.recv(left)
385 except error, e:
386 if e.args[0] == EINTR:
timeout: timed out
{code}
The number of partitions is 4.
Configuration:
- latest macOS Sierra (10.12.3),
- IPython, etc. provided through MacPorts,
- no special Spark configuration except for the verbosity level.
> toLocalIterator yields time out error on pyspark2
> -------------------------------------------------
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
> Reporter: Luke Miner
> Assignee: Liang-Chi Hsieh
> Fix For: 2.0.3, 2.1.1
>
>
> I run the example straight out of the api docs for toLocalIterator and it
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory 16G
> spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 500000
> spark.hadoop.fs.s3n.multipart.uploads.enabled true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadata false
> spark.jars.packages
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse false
> spark.mesos.constraints priority:1
> spark.network.timeout 600
> spark.rpc.message.maxSize 500
> spark.speculation false
> spark.sql.parquet.mergeSchema false
> spark.sql.planner.externalSort true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---------------------------------------------------------------------------
> timeout Traceback (most recent call last)
> <ipython-input-1-6319dd276401> in <module>()
> 2 sc = SparkContext()
> 3 rdd = sc.parallelize(range(10))
> ----> 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in
> _read_with_length(self, stream)
> 154
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in
> read_int(stream)
> 541
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]