Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r194574775
--- Diff: pom.xml ---
@@ -760,6 +760,12 @@
<version>1.10.19</version>
<scope>test</scope>
</dependency>
+ <dependency>
--- End diff --
> One question is how about performance at scale when you get events from
hundreds of executors at once which framework should work best? Should we worry
about this?
One already runs into this problem since we open a Watch that streams all
events down anyways. In any implementation where we want events to be processed
at different intervals, there needs to be some buffering or else we choose to
ignore some events and only look at the most up to date snapshot at the given
intervals. As discussed in
https://github.com/apache/spark/pull/21366#discussion_r194181797 we really want
to process as many events as we get as possible, so we're stuck with buffering
somewhere, and regardless of the observables or reactive programming framework
we pick we still have to effectively store `O(E)` items, `E` being the number
of events. And aside from the buffering we'd need to consider the scale of the
stream of events flowing from the persistent HTTP connection backing the Watch.
In this regard we are no different from the other custom controllers in the
Kubernetes ecosystem which have to handle managing large number of pods.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]