GitHub user ScrapCodes opened a pull request:
https://github.com/apache/spark/pull/22339
SPARK-17159 Significant speed up for running spark streaming against Object
store.
## What changes were proposed in this pull request?
Original work by Steve Loughran.
Based on #17745.
This is a minimal patch of changes to FileInputDStream to reduce File
status requests when querying files. Each call to file status is 3+ http calls
to object store. This patch eliminates the need for it, by using FileStatus
objects.
This is a minor optimisation when working with filesystems, but significant
when working with object stores.
## How was this patch tested?
Tests included. Existing tests pass.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ScrapCodes/spark PR_17745
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22339.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22339
----
commit 2fba9af597349fc023e04a845d1cfacfc3ab7d9e
Author: Steve Loughran <stevel@...>
Date: 2017-04-24T13:04:04Z
SPARK-17159 Significant speed up for running spark streaming against Object
store.
Based on #17745. Original work by Steve Loughran.
This is a minimal patch of changes to FileInputDStream to reduce File
status requests when querying files.
This is a minor optimisation when working with filesystems, but significant
when working with object stores.
Change-Id: I269d98902f615818941c88de93a124c65453756e
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]