On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman <
> Could you expand a little bit more on stability ? Is it just bursty
> workloads in terms of peak vs. average throughput ? Also what level of
> latencies do you find users care about ? Is it on the order of 2-3
> seconds vs. 1 second vs. 100s of milliseconds ?
Regarding stability, I've seen two levels of concrete requirements.
The first is "don't bring down my Spark cluster". That is to say,
regardless of the input data rate, Spark shouldn't thrash or crash
outright. Processing may lag behind the data arrival rate, but the cluster
should stay up and remain fully functional.
The second level is "don't bring down my application". A common use for
streaming systems is to handle heavyweight computations that are part of a
larger application, like a web application, a mobile app, or a plant
control system. For example, an online application for car insurance might
need to do some pretty involved machine learning to produce an accurate
quote and suggest good upsells to the customer. If the heavyweight portion
times out, the whole application times out, and you lose a customer.
In terms of bursty vs. non-bursty, the "don't bring down my Spark cluster
case" is more about handling bursts, while the "don't bring down my
application" case is more about delivering acceptable end-to-end response
times under typical load.
Regarding latency: One group I talked to mentioned requirements in the
100-200 msec range, driven by the need to display a web page on a browser
or mobile device. Another group in the Internet of Things space mentioned
times ranging from 5 seconds to 30 seconds throughout the conversation. But
most people I've talked to have been pretty vague about specific numbers.
My impression is that these groups are not motivated by anxiety about
meeting a particular latency target for a particular application. Rather,
they want to make low latency the norm so that they can stop having to
think about latency. Today, low latency is a special requirement of special
applications. But that policy imposes a lot of hidden costs. IT architects
have to spend time estimating the latency requirements of every application
and lobbying for special treatment when those requirements are strict.
Managers have to spend time engineering business processes around latency.
Data scientists have to spend time packaging up models and negotiating how
those models will be shipped over to the low-latency serving tier. And
customers who are accustomed to Google and smartphones end up with an
experience that is functional but unsatisfying.
It's best to think of latency as a sliding scale. A given level of latency
imposes a given level of cost enterprise-wide. Someone who is making a
decision on middleware policy will balance this cost against other costs:
How much does it cost to deploy the middleware? How much does it cost to
train developers to use the system? The winner will be the system that
minimizes the overall cost.