GitHub user frreiss opened a pull request: https://github.com/apache/spark/pull/15162
[SPARK-17386] [STREAMING] [WIP] Make polling rate adaptive ## What changes were proposed in this pull request? This change makes the scheduler in `StreamExecution` adjust its rate of polling to the rate of data arrival. I used the [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) algorithm, aka TCP congestion avoidance, because it is simple. I renamed `spark.sql.streaming.pollingDelay` to `spark.sql.streaming.minPollingDelay` and added a second parameter `spark.sql.streaming.maxPollingDelay` to serve as an upper bound. This upper bound is necessary because the data arrival rate could be bursty, with infinite variance. This approach works, but I'm not completely satisfied with the current design. The blocking-system-call part of polling really ought to be delegated to the sources themselves, so that the main scheduler thread won't be blocked for unpredictable amounts of time. Also, the polling rate ought to be adaptive on a per-source basis, instead of there being a single global rate per session. Leaving this PR marked as WIP for now. ## How was this patch tested? I tested for three scenarios: * If data is not present, the scheduler ramps up the polling delay quickly to `spark.sql.streaming.maxPollingDelay` * If data arrives quickly, the scheduler keeps its polling delay pinned at `spark.sql.streaming.minPollingDelay` * If new batches arrive at a constant interval between the minimum and maximum delay, the scheduler keeps its polling delay within a factor of two of the actual arrival rate. The testing code is included in this PR. Note that the test cases for the two latter cases are disabled by default, because they are somewhat timing dependent. You can merge this pull request into a Git repository by running: $ git pull https://github.com/frreiss/spark-fred fred-17386a Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15162.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15162 ---- commit 54a8272f1810514ec0a1e92e222359f505c5e58b Author: frreiss <frre...@us.ibm.com> Date: 2016-09-10T05:15:22Z Made polling rate adaptive. commit 24a1f74ff5c401e39858d887212fe4acc5b17bca Author: frreiss <frre...@us.ibm.com> Date: 2016-09-14T22:04:19Z Added test case and corrected some bugs. commit 9b44439d784e51ed75f4476b4432d13155dcdfb0 Author: frreiss <frre...@us.ibm.com> Date: 2016-09-14T22:26:59Z Merge branch 'master' of https://github.com/apache/spark into fred-17386a commit 4439d07e79ca48b5037ecb50017013c75172f6af Author: frreiss <frre...@us.ibm.com> Date: 2016-09-20T16:09:29Z Merge branch 'master' of https://github.com/apache/spark into fred-17386a ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org