GitHub user sameeragarwal opened a pull request:
https://github.com/apache/spark/pull/12667
[SPARK-14467][SQL] Experiments: Async I/O in FileScanRDD
## What changes were proposed in this pull request?
Builds on https://github.com/apache/spark/pull/12243 to help benchmark
improvements by interleaving CPU and I/O in FileScanRDD.
## How was this patch tested?
Existing Tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sameeragarwal/spark filescan
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12667.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12667
----
commit cc6d98a17f6fa4249951802f981c2224d354e651
Author: Nong Li <[email protected]>
Date: 2016-04-05T20:36:34Z
[SPARK-14467][SQL] Interleave CPU and IO better in FileScanRDD.
This patch updates FileScanRDD to start reading from the next file while
the current file
is being processed. The goal is to have better interleaving of CPU and IO.
It does this
by launching a future which will asynchronously start preparing the next
file to be read.
The expectation is that the async task is IO intensive and the current file
(which
includes all the computation for the query plan) is CPU intensive. For some
file formats,
this would just mean opening the file and the initial setup. For file
formats like
parquet, this would mean doing all the IO for all the columns.
commit bc11dd580a751b2e39694223ecbf1fa2b4a7bdc0
Author: Nong Li <[email protected]>
Date: 2016-04-07T23:26:50Z
Simplify and fix tests.
commit 0655e5ef7e7c9d216ddef06500fbfd683941056f
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-25T19:01:38Z
Resolve conflicts
commit 8aebf9427e6630046ae297b38f964d0809c3d348
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-25T19:07:39Z
restructure
commit 8799cc873900cf9e4c37012e7a6d607eeabfbdd5
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-25T20:08:47Z
add nextIterator
commit f3a21672e94cdb67e2cb69d60af327cff0b2cf54
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-25T20:36:20Z
cleanup
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]