[
https://issues.apache.org/jira/browse/SPARK-45093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alice Sayutina updated SPARK-45093:
-----------------------------------
Description:
I've been trying to do some testing of udf's using code in other module, so
that AddArtifact is necessary.
I got the following error:
{code:java}
Traceback (most recent call last):
File "/Users/alice.sayutina/db-connect-playground/udf2.py", line 5, in
<module>
spark.addArtifacts("udf2_support.py", pyfile=True)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/session.py",
line 744, in addArtifacts
self._client.add_artifacts(*path, pyfile=pyfile, archive=archive, file=file)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py",
line 1582, in add_artifacts
self._artifact_manager.add_artifacts(*path, pyfile=pyfile, archive=archive,
file=file)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
line 283, in add_artifacts
self._request_add_artifacts(requests)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
line 259, in _request_add_artifacts
response: proto.AddArtifactsResponse = self._retrieve_responses(requests)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
line 256, in _retrieve_responses
return self._stub.AddArtifacts(requests, metadata=self._metadata)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py",
line 1246, in __call__
return _end_unary_response_blocking(state, call, False, None)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py",
line 910, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating requests!"
debug_error_string = "None"
{code}
Which doesn't give any clue about what happens.
Only after noticeable investigation I found the problem: I'm specifying the
wrong path and the artifact fails to upload. Specifically what happens is that
ArtifactManager doesn't read the file immediately, but rather creates iterator
object which will incrementally generate requests to send. This iterator is
passed to grpc's stream_unary to consume and actually send, and while grpc
catches the error (see above), it suppresses the underlying exception.
I think we should improve pyspark user experience. One of the possible ways to
fix this is to wrap ArtifactsManager._create_requests with an iterator wrapper
which would log the throwable into spark connect logger so that user would see
something like below at least when the debug mode is on.
{code:java}
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/alice.sayutina/udf2_support.py' {code}
was:
I've been trying to do some testing of udf's using code in other module, so
that AddArtifact is necessary.
I got the following error:
{code:java}
Traceback (most recent call last):
File "/Users/alice.sayutina/db-connect-playground/udf2.py", line 5, in
<module>
spark.addArtifacts("udf2_support.py", pyfile=True)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/session.py",
line 744, in addArtifacts
self._client.add_artifacts(*path, pyfile=pyfile, archive=archive, file=file)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py",
line 1582, in add_artifacts
self._artifact_manager.add_artifacts(*path, pyfile=pyfile, archive=archive,
file=file)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
line 283, in add_artifacts
self._request_add_artifacts(requests)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
line 259, in _request_add_artifacts
response: proto.AddArtifactsResponse = self._retrieve_responses(requests)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
line 256, in _retrieve_responses
return self._stub.AddArtifacts(requests, metadata=self._metadata)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py",
line 1246, in __call__
return _end_unary_response_blocking(state, call, False, None)
File
"/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py",
line 910, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating requests!"
debug_error_string = "None"
{code}
Which doesn't give any clue about what happens.
Only after noticeable investigation I found the problem: I'm specifying the
wrong path and the artifact fails to upload. Specifically what happens is that
ArtifactManager doesn't read the file immediately, but rather creates iterator
object which will generate requests to send. This iterator is passed to grpc's
stream_unary to consume and actually send, and while grpc catches the error
(see above), it suppresses the underlying exception.
I think we should improve pyspark user experience. One of the possible ways to
fix this is to wrap ArtifactsManager._create_requests with an iterator wrapper
which would log the throwable into spark connect logger so that user would see
something like below at least when the debug mode is on.
{code:java}
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/alice.sayutina/udf2_support.py' {code}
> AddArtifacts should give proper error messages if it fails
> ----------------------------------------------------------
>
> Key: SPARK-45093
> URL: https://issues.apache.org/jira/browse/SPARK-45093
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.5.0
> Reporter: Alice Sayutina
> Priority: Major
>
> I've been trying to do some testing of udf's using code in other module, so
> that AddArtifact is necessary.
>
> I got the following error:
>
>
> {code:java}
> Traceback (most recent call last):
> File "/Users/alice.sayutina/db-connect-playground/udf2.py", line 5, in
> <module>
> spark.addArtifacts("udf2_support.py", pyfile=True)
> File
> "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/session.py",
> line 744, in addArtifacts
> self._client.add_artifacts(*path, pyfile=pyfile, archive=archive,
> file=file)
> File
> "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py",
> line 1582, in add_artifacts
> self._artifact_manager.add_artifacts(*path, pyfile=pyfile,
> archive=archive, file=file)
> File
> "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
> line 283, in add_artifacts
> self._request_add_artifacts(requests)
> File
> "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
> line 259, in _request_add_artifacts
> response: proto.AddArtifactsResponse = self._retrieve_responses(requests)
> File
> "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py",
> line 256, in _retrieve_responses
> return self._stub.AddArtifacts(requests, metadata=self._metadata)
> File
> "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py",
> line 1246, in __call__
> return _end_unary_response_blocking(state, call, False, None)
> File
> "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py",
> line 910, in _end_unary_response_blocking
> raise _InactiveRpcError(state) # pytype: disable=not-instantiable
> grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated
> with:
> status = StatusCode.UNKNOWN
> details = "Exception iterating requests!"
> debug_error_string = "None"
> {code}
>
> Which doesn't give any clue about what happens.
> Only after noticeable investigation I found the problem: I'm specifying the
> wrong path and the artifact fails to upload. Specifically what happens is
> that ArtifactManager doesn't read the file immediately, but rather creates
> iterator object which will incrementally generate requests to send. This
> iterator is passed to grpc's stream_unary to consume and actually send, and
> while grpc catches the error (see above), it suppresses the underlying
> exception.
> I think we should improve pyspark user experience. One of the possible ways
> to fix this is to wrap ArtifactsManager._create_requests with an iterator
> wrapper which would log the throwable into spark connect logger so that user
> would see something like below at least when the debug mode is on.
>
> {code:java}
> FileNotFoundError: [Errno 2] No such file or directory:
> '/Users/alice.sayutina/udf2_support.py' {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]