GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/15271
[SPARK-17666] Ensure that RecordReaders are closed by data source file
scans (backport)
This is a branch-2.0 backport of #15245.
## What changes were proposed in this pull request?
This patch addresses a potential cause of resource leaks in data source
file scans. As reported in
[SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which
do not fully-consume their input may cause file handles / network connections
(e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext
callback to [close its record
readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208),
but the new data source file scans will only close record readers once their
iterators are fully-consumed.
This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to
add `close()` methods and modifies all six implementations of
`FileFormat.buildReader()` to register TaskContext task completion callbacks to
guarantee that cleanup is eventually performed.
## How was this patch tested?
Tested manually for now.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark SPARK-17666-backport
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15271.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15271
----
commit c0621dbbd558e3715ea3b0c250913c1b94e34478
Author: Josh Rosen <[email protected]>
Date: 2016-09-28T00:52:57Z
[SPARK-17666] Ensure that RecordReaders are closed by data source file scans
This patch addresses a potential cause of resource leaks in data source
file scans. As reported in
[SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which
do not fully-consume their input may cause file handles / network connections
(e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext
callback to [close its record
readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208),
but the new data source file scans will only close record readers once their
iterators are fully-consumed.
This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to
add `close()` methods and modifies all six implementations of
`FileFormat.buildReader()` to register TaskContext task completion callbacks to
guarantee that cleanup is eventually performed.
Tested manually for now.
Author: Josh Rosen <[email protected]>
Closes #15245 from JoshRosen/SPARK-17666-close-recordreader.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]