Hi Ivan, Thanks for sharing these insights!
Autoscaling is exactly one motivation for me to bring this topic up. I understand that the auto-recovery is not perfect at the moment, but it's an important component that maintains the core invariants of a bookkeeper cluster, so I think we may keep improving it until we find a better replacement. I'm thinking maybe we can put the "draining" state as a special member in the properties of `BookieServiceInfo <https://github.com/apache/bookkeeper/blob/97818f5123999396e66f5246420d3c7e3d25f53d/bookkeeper-server/src/main/java/org/apache/bookkeeper/discover/BookieServiceInfo.java#L43>`, and let the auditor check the properties of readonly bookies to see if a bookie need to be drained and seen as unavailable. The bookie state API <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/server/http/service/BookieStateService.java> might also be enhanced to support updating and persisting the state of a bookie dynamically. And a new API might need to be added to check if all ledgers have been moved off a "draining" bookie. Do you think these changes make sense? Best regards, Yang Yang Ivan Kelly <iv...@apache.org> 于2021年9月6日周一 下午7:45写道: > Hi Yang, > > This is something we've been thinking about internally. It's > especially important if we want to implement auto scaling for bookies. > > I'm not sure we need a "draining" state as such. Or at least, the > draining state doesn't need to be at the same level as "read-only". > "draining" is only interesting to the entity moving data off of the > bookie, so the auditor could keep a record that it is draining bookie > X, > but the cluster as a whole doesn't need to care about it. > > From a logical point of view, decommissioning/scale down should be a > matter of > 1. Mark bookie as read only so that no new data is added to it > 2. Wait for the bookie to hold no live data > > The most important thing, and a thing that we have really missed in > the past is the ability to mark a running bookie as read-only. > This should be trivial to implement, though the split between > bookie-shell and bk-cli is a bit of a mess right now. I think there is > a > REST api endpoint, but it is non-persistent. > > Once the bookie is read-only, we have the following options for > getting live data off of the bookie. > 1. Wait for pulsar retention period to pass (currently available, but > can take a long time). > 2. Use tiered storage to move older data off. This is currently > implemented as a pulsar feature, but I think it would make sense to > move it down to the bookie layer. > 3. Use auditor/auto recovery to move the data. > > Personally I'm not a fan of the auditor/auto recovery stuff currently > in bookkeeper. Any time we've relied on it in the past it has > blown up on us or moved too slowly to be very useful. Part of the > problem is that it conflates data integrity checking, with bookie > decommissioning. Data integrity is concerned with ensuring a bookie > has the data zookeeper says it has. > Decommissioning is moving data off of a bookie. One should be > naturally cheap, the other expensive. With autorecovery, they both > end up expensive. > > A problem with both tiered storage and autorecovery for > decommissioning, is that they need to move data and so induce load in > the > system. However, this load isn't well quantified, so they use manually > set rate limiting, which doesn't respond to the rest of the load > in the system. The first thing we need to do, and we are actively > working on this, is to generate accurate utilization and saturation > metrics for bookies. Once we have these metrics, the entity copying > the data can do so much faster without impacting production > traffic. > > Cheers, > Ivan >