[
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Imran Rashid updated SPARK-26019:
---------------------------------
Description:
pyspark's accumulator server expects a secure py4j connection between python
and the jvm. Spark will normally create a secure connection, but there is a
public api which allows you to pass in your own py4j connection. (this is used
by zeppelin, at least.) When this happens, you get an error like:
{noformat}
pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in
authenticate_and_accum_updates()
{noformat}
We should change pyspark to
1) warn loudly if a user passes in an insecure connection
1a) I'd like to suggest that we even error out, unless the user actively
opts-in with a config like "Uploaded image for project: 'Spark'
SparkSPARK-26019
pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in
authenticate_and_accum_updates()
note that SPARK-26349 will disallow insecure connections completely in 3.0
{code:python}
Exception happened during processing of request from ('127.0.0.1', 43418)
----------------------------------------
Traceback (most recent call last):
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
290, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
318, in process_request
self.finish_request(request, client_address)
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
331, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
652, in __init__
self.handle()
File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
line 263, in handle
poll(authenticate_and_accum_updates)
File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
line 238, in poll
if func():
File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
line 251, in authenticate_and_accum_updates
received_token = self.rfile.read(len(auth_token))
TypeError: object of type 'NoneType' has no len()
{code}
Error happens here:
https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
The PySpark code was just running a simple pipeline of
binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
and then converting it to a dataframe and running a count on it.
It seems error is flaky - on next rerun it didn't happen.
was:
pyspark's accumulator server expects a secure py4j connection between python
and the jvm. Spark will normally create a secure connection, but there is a
public api which allows you to pass in your own py4j connection. (this is used
by zeppelin, at least.) When this happens, you get an error like:
{noformat}
pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in
authenticate_and_accum_updates()
{noformat}
We should change pyspark to
1) warn loudly if a user passes in an insecure connection
1a) I'd like to suggest that we even error out, unless the user
note that SPARK-26349 will disallow insecure connections completely in 3.0
{code:python}
Exception happened during processing of request from ('127.0.0.1', 43418)
----------------------------------------
Traceback (most recent call last):
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
290, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
318, in process_request
self.finish_request(request, client_address)
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
331, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
652, in __init__
self.handle()
File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
line 263, in handle
poll(authenticate_and_accum_updates)
File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
line 238, in poll
if func():
File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
line 251, in authenticate_and_accum_updates
received_token = self.rfile.read(len(auth_token))
TypeError: object of type 'NoneType' has no len()
{code}
Error happens here:
https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
The PySpark code was just running a simple pipeline of
binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
and then converting it to a dataframe and running a count on it.
It seems error is flaky - on next rerun it didn't happen.
> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()"
> in authenticate_and_accum_updates()
> ----------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.3.2, 2.4.0
> Reporter: Ruslan Dautkhanov
> Priority: Major
>
> pyspark's accumulator server expects a secure py4j connection between python
> and the jvm. Spark will normally create a secure connection, but there is a
> public api which allows you to pass in your own py4j connection. (this is
> used by zeppelin, at least.) When this happens, you get an error like:
> {noformat}
> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()"
> in authenticate_and_accum_updates()
> {noformat}
> We should change pyspark to
> 1) warn loudly if a user passes in an insecure connection
> 1a) I'd like to suggest that we even error out, unless the user actively
> opts-in with a config like "Uploaded image for project: 'Spark'
> SparkSPARK-26019
> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()"
> in authenticate_and_accum_updates()
> note that SPARK-26349 will disallow insecure connections completely in 3.0
>
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> ----------------------------------------
> Traceback (most recent call last):
> File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
> 290, in _handle_request_noblock
> self.process_request(request, client_address)
> File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
> 318, in process_request
> self.finish_request(request, client_address)
> File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
> 331, in finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line
> 652, in __init__
> self.handle()
> File
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
> line 263, in handle
> poll(authenticate_and_accum_updates)
> File
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
> line 238, in poll
> if func():
> File
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
> line 251, in authenticate_and_accum_updates
> received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>
> {code}
>
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]