GitHub user frreiss opened a pull request:
https://github.com/apache/spark/pull/15027
[SPARK-17475] [STREAMING] Delete CRC files if the filesystem doesn't use
checksum files
## What changes were proposed in this pull request?
When the metadata logs for various parts of Structured Streaming are stored
on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves
hidden HDFS-style checksum (CRC) files in the log directory, one file per
batch. This PR modifies HDFSMetadataLog so that it detects the use of a
filesystem that doesn't use CRC files and removes the CRC files.
## How was this patch tested?
Modified an existing test case in HDFSMetadataLogSuite to check whether
HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem. Ran
the entire regression suite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/frreiss/spark-fred fred-17475
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15027.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15027
----
commit 3a2a9c116659f526189de6b8b98fb6c92024a7a6
Author: frreiss <[email protected]>
Date: 2016-09-09T17:30:08Z
Delete CRC files when the filesystem doesn't support checksums.
commit 9ff89c0228c09764fa6444528050a35e823db0e6
Author: frreiss <[email protected]>
Date: 2016-09-09T17:30:52Z
Merge branch 'master' of https://github.com/apache/spark into fred-17475
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]