GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/640
SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
This patch includes several cleanups to PythonRDD, focused around fixing
[SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed
in order of approximate importance:
- The Python daemon waits for Spark to close the socket before exiting,
in order to avoid causing spurious IOExceptions in Spark's
`PythonRDD::WriterThread`.
- Removes the Python Monitor Thread, which polled for task cancellations
in order to kill the Python worker. Instead, we do this in the
onCompleteCallback, since this is guaranteed to be called during
cancellation.
- Adds a "completed" variable to TaskContext to avoid the issue noted in
[SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where
onCompleteCallbacks may be execution-order dependent.
Along with this, I removed the "context.interrupted = true" flag in
the onCompleteCallback.
- Extracts PythonRDD::WriterThread to its own class.
Since this patch provides an alternative solution to
[SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it
with
```
sc.textFile("latlon.tsv").take(5)
```
many times without error.
Additionally, in order to test the unswallowed exceptions, I performed
```
sc.textFile("s3n://<big file>").count()
```
and cut my internet during execution. Prior to this patch, we got the
"stdin writer exited early" message, which was unhelpful. Now, we get the
SocketExceptions propagated through Spark to the user and get proper (though
unsuccessful) task retries.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark pyspark-io
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/640.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #640
----
commit c0c49da3754d664ab1de76ff8e6fc86a4d5d8714
Author: Aaron Davidson <[email protected]>
Date: 2014-05-05T02:26:03Z
SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
This patch includes several cleanups to PythonRDD, focused around
fixing SPARK-1579 cleanly. Listed in order of importance:
- The Python daemon waits for Spark to close the socket before exiting,
in order to avoid causing spurious IOExceptions in Spark's
PythonRDD::WriterThread.
- Removes the Python Monitor Thread, which polled for task cancellations
in order to kill the Python worker. Instead, we do this in the
onCompleteCallback, since this is guaranteed to be called during
cancellation.
- Adds a "completed" variable to TaskContext to avoid the issue noted in
SPARK-1019, where onCompleteCallbacks may be execution-order dependent.
Along with this, I removed the "context.interrupted = true" flag in
the onCompleteCallback.
- Extracts PythonRDD::WriterThread to its own class.
Since this patch provides an alternative solution to SPARK-1019, I did
test it with
```sc.textFile("latlon.tsv").take(5)```
many times without error.
Additionally, in order to test the unswallowed exceptions, I performed
```sc.textFile("s3n://<big file>").count()```
and cut my internet. Prior to this patch, we got the "stdin writer
exited early" message, which is unhelpful. Now, we get the
SocketExceptions propagated through Spark to the user and get
proper (but unsuccessful) task retries.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---