GitHub user frreiss opened a pull request:

    https://github.com/apache/spark/pull/15162

    [SPARK-17386] [STREAMING] [WIP] Make polling rate adaptive

    ## What changes were proposed in this pull request?
    
    This change makes the scheduler in `StreamExecution` adjust its rate of 
polling to the rate of data arrival. I used the 
[AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) 
algorithm, aka TCP congestion avoidance, because it is simple. I renamed
    `spark.sql.streaming.pollingDelay` to `spark.sql.streaming.minPollingDelay` 
and added a second parameter `spark.sql.streaming.maxPollingDelay` to serve as 
an upper bound. This upper bound is necessary because the data arrival rate 
could be bursty, with infinite variance.
    
    This approach works, but I'm not completely satisfied with the current 
design. The blocking-system-call part of polling really ought to be delegated 
to the sources themselves, so that the main scheduler thread won't be blocked 
for unpredictable amounts of time. Also, the polling rate ought to be adaptive 
on a per-source basis, instead of there being a single global rate per session. 
Leaving this PR marked as WIP for now.
    
    ## How was this patch tested?
    
    I tested for three scenarios:
    * If data is not present, the scheduler ramps up the polling delay quickly 
to `spark.sql.streaming.maxPollingDelay`
    * If data arrives quickly, the scheduler keeps its polling delay pinned at 
`spark.sql.streaming.minPollingDelay`
    * If new batches arrive at a constant interval between the minimum and 
maximum delay, the scheduler keeps its polling delay within a factor of two of 
the actual arrival rate.
    
    The testing code is included in this PR. Note that the test cases for the 
two latter cases are disabled by default, because they are somewhat timing 
dependent.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/frreiss/spark-fred fred-17386a

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15162.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15162
    
----
commit 54a8272f1810514ec0a1e92e222359f505c5e58b
Author: frreiss <frre...@us.ibm.com>
Date:   2016-09-10T05:15:22Z

    Made polling rate adaptive.

commit 24a1f74ff5c401e39858d887212fe4acc5b17bca
Author: frreiss <frre...@us.ibm.com>
Date:   2016-09-14T22:04:19Z

    Added test case and corrected some bugs.

commit 9b44439d784e51ed75f4476b4432d13155dcdfb0
Author: frreiss <frre...@us.ibm.com>
Date:   2016-09-14T22:26:59Z

    Merge branch 'master' of https://github.com/apache/spark into fred-17386a

commit 4439d07e79ca48b5037ecb50017013c75172f6af
Author: frreiss <frre...@us.ibm.com>
Date:   2016-09-20T16:09:29Z

    Merge branch 'master' of https://github.com/apache/spark into fred-17386a

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to