Thanks for those insights Ying. On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng <yi...@uber.com.invalid> wrote:
> > > > > > > > Thanks, I missed that point. However, there's still a point at which the > > consumer fetches start getting served from remote storage (even if that > > point isn't as soon as the local log retention time/size). This > represents > > a kind of performance cliff edge and what I'm really interested in is how > > easy it is for a consumer which falls off that cliff to catch up and so > its > > fetches again come from local storage. Obviously this can depend on all > > sorts of factors (like production rate, consumption rate), so it's not > > guaranteed (just like it's not guaranteed for Kafka today), but this > would > > represent a new failure mode. > > > > As I have explained in the last mail, it's a very rare case that a > consumer > need to read remote data. With our experience at Uber, this only happens > when the consumer service had an outage for several hours. > > There is not a "performance cliff" as you assume. The remote storage is > even faster than local disks in terms of bandwidth. Reading from remote > storage is going to have higher latency than local disk. But since the > consumer > is catching up several hours data, it's not sensitive to the sub-second > level > latency, and each remote read request will read a large amount of data to > make the overall performance better than reading from local disks. > > > > > Another aspect I'd like to understand better is the effect of serving > fetch > > request from remote storage has on the broker's network utilization. If > > we're just trimming the amount of data held locally (without increasing > the > > overall local+remote retention), then we're effectively trading disk > > bandwidth for network bandwidth when serving fetch requests from remote > > storage (which I understand to be a good thing, since brokers are > > often/usually disk bound). But if we're increasing the overall > local+remote > > retention then it's more likely that network itself becomes the > bottleneck. > > I appreciate this is all rather hand wavy, I'm just trying to understand > > how this would affect broker performance, so I'd be grateful for any > > insights you can offer. > > > > > Network bandwidth is a function of produce speed, it has nothing to do with > remote retention. As long as the data is shipped to remote storage, you can > keep the data there for 1 day or 1 year or 100 years, it doesn't consume > any > network resources. >