Thanks Fred for the detailed reply. The stability points are
especially interesting as a goal for the streaming component in Spark.
In terms of next steps, one approach that might be helpful is trying
to create benchmarks or situations that mimic real-life workloads and
then we can work on isolating specific changes that are required etc.
It'd also be great to hear other approaches / next steps to concretize
some of these goals.
On Thu, Oct 13, 2016 at 8:39 AM, Fred Reiss <freiss....@gmail.com> wrote:
> On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>> Could you expand a little bit more on stability ? Is it just bursty
>> workloads in terms of peak vs. average throughput ? Also what level of
>> latencies do you find users care about ? Is it on the order of 2-3
>> seconds vs. 1 second vs. 100s of milliseconds ?
> Regarding stability, I've seen two levels of concrete requirements.
> The first is "don't bring down my Spark cluster". That is to say, regardless
> of the input data rate, Spark shouldn't thrash or crash outright. Processing
> may lag behind the data arrival rate, but the cluster should stay up and
> remain fully functional.
> The second level is "don't bring down my application". A common use for
> streaming systems is to handle heavyweight computations that are part of a
> larger application, like a web application, a mobile app, or a plant control
> system. For example, an online application for car insurance might need to
> do some pretty involved machine learning to produce an accurate quote and
> suggest good upsells to the customer. If the heavyweight portion times out,
> the whole application times out, and you lose a customer.
> In terms of bursty vs. non-bursty, the "don't bring down my Spark cluster
> case" is more about handling bursts, while the "don't bring down my
> application" case is more about delivering acceptable end-to-end response
> times under typical load.
> Regarding latency: One group I talked to mentioned requirements in the
> 100-200 msec range, driven by the need to display a web page on a browser or
> mobile device. Another group in the Internet of Things space mentioned times
> ranging from 5 seconds to 30 seconds throughout the conversation. But most
> people I've talked to have been pretty vague about specific numbers.
> My impression is that these groups are not motivated by anxiety about
> meeting a particular latency target for a particular application. Rather,
> they want to make low latency the norm so that they can stop having to think
> about latency. Today, low latency is a special requirement of special
> applications. But that policy imposes a lot of hidden costs. IT architects
> have to spend time estimating the latency requirements of every application
> and lobbying for special treatment when those requirements are strict.
> Managers have to spend time engineering business processes around latency.
> Data scientists have to spend time packaging up models and negotiating how
> those models will be shipped over to the low-latency serving tier. And
> customers who are accustomed to Google and smartphones end up with an
> experience that is functional but unsatisfying.
> It's best to think of latency as a sliding scale. A given level of latency
> imposes a given level of cost enterprise-wide. Someone who is making a
> decision on middleware policy will balance this cost against other costs:
> How much does it cost to deploy the middleware? How much does it cost to
> train developers to use the system? The winner will be the system that
> minimizes the overall cost.
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org