[GitHub] spark pull request: [SPARK-14467][SQL] Experiments: Async I/O in F...

sameeragarwal Mon, 25 Apr 2016 13:41:00 -0700

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/12667


    [SPARK-14467][SQL] Experiments: Async I/O in FileScanRDD

    ## What changes were proposed in this pull request?
    
    Builds on https://github.com/apache/spark/pull/12243 to help benchmark 
improvements by interleaving CPU and I/O in FileScanRDD. 
    
    ## How was this patch tested?
    
    Existing Tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark filescan

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12667
    
----
commit cc6d98a17f6fa4249951802f981c2224d354e651
Author: Nong Li <[email protected]>
Date:   2016-04-05T20:36:34Z

    [SPARK-14467][SQL] Interleave CPU and IO better in FileScanRDD.
    
    This patch updates FileScanRDD to start reading from the next file while 
the current file
    is being processed. The goal is to have better interleaving of CPU and IO. 
It does this
    by launching a future which will asynchronously start preparing the next 
file to be read.
    The expectation is that the async task is IO intensive and the current file 
(which
    includes all the computation for the query plan) is CPU intensive. For some 
file formats,
    this would just mean opening the file and the initial setup. For file 
formats like
    parquet, this would mean doing all the IO for all the columns.

commit bc11dd580a751b2e39694223ecbf1fa2b4a7bdc0
Author: Nong Li <[email protected]>
Date:   2016-04-07T23:26:50Z

    Simplify and fix tests.

commit 0655e5ef7e7c9d216ddef06500fbfd683941056f
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-25T19:01:38Z

    Resolve conflicts

commit 8aebf9427e6630046ae297b38f964d0809c3d348
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-25T19:07:39Z

    restructure

commit 8799cc873900cf9e4c37012e7a6d607eeabfbdd5
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-25T20:08:47Z

    add nextIterator

commit f3a21672e94cdb67e2cb69d60af327cff0b2cf54
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-25T20:36:20Z

    cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14467][SQL] Experiments: Async I/O in F...

Reply via email to