Hi Ivan,

Thanks for sharing these insights!

Autoscaling is exactly one motivation for me to bring this topic up. I
understand that the auto-recovery is not perfect at the moment, but it's an
important component that maintains the core invariants of a bookkeeper
cluster, so I think we may keep improving it until we find a better
replacement.

I'm thinking maybe we can put the "draining" state as a special member in
the properties of `BookieServiceInfo
<https://github.com/apache/bookkeeper/blob/97818f5123999396e66f5246420d3c7e3d25f53d/bookkeeper-server/src/main/java/org/apache/bookkeeper/discover/BookieServiceInfo.java#L43>`,
and let the auditor check the properties of readonly bookies to see if a
bookie need to be drained and seen as unavailable. The bookie state API
<https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/server/http/service/BookieStateService.java>
might
also be enhanced to support updating and persisting the state of a bookie
dynamically. And a new API might need to be added to check if all ledgers
have been moved off a "draining" bookie. Do you think these changes make
sense?

Best regards,
Yang Yang


Ivan Kelly <iv...@apache.org> 于2021年9月6日周一 下午7:45写道:

> Hi Yang,
>
> This is something we've been thinking about internally. It's
> especially important if we want to implement auto scaling for bookies.
>
> I'm not sure we need a "draining" state as such. Or at least, the
> draining state doesn't need to be at the same level as "read-only".
> "draining" is only interesting to the entity moving data off of the
> bookie, so the auditor could keep a record that it is draining bookie
> X,
> but the cluster as a whole doesn't need to care about it.
>
> From a logical point of view, decommissioning/scale down should be a
> matter of
> 1. Mark bookie as read only so that no new data is added to it
> 2. Wait for the bookie to hold no live data
>
> The most important thing, and a thing that we have really missed in
> the past is the ability to mark a running bookie as read-only.
> This should be trivial to implement, though the split between
> bookie-shell and bk-cli is a bit of a mess right now. I think there is
> a
> REST api endpoint, but it is non-persistent.
>
> Once the bookie is read-only, we have the following options for
> getting live data off of the bookie.
> 1. Wait for pulsar retention period to pass (currently available, but
> can take a long time).
> 2. Use tiered storage to move older data off. This is currently
> implemented as a pulsar feature, but I think it would make sense to
> move it down to the bookie layer.
> 3. Use auditor/auto recovery to move the data.
>
> Personally I'm not a fan of the auditor/auto recovery stuff currently
> in bookkeeper. Any time we've relied on it in the past it has
> blown up on us or moved too slowly to be very useful. Part of the
> problem is that it conflates data integrity checking, with bookie
> decommissioning. Data integrity is concerned with ensuring a bookie
> has the data zookeeper says it has.
> Decommissioning is moving data off of a bookie. One should be
> naturally cheap, the other expensive. With autorecovery, they both
> end up expensive.
>
> A problem with both tiered storage and autorecovery for
> decommissioning, is that they need to move data and so induce load in
> the
> system. However, this load isn't well quantified, so they use manually
> set rate limiting, which doesn't respond to the rest of the load
> in the system. The first thing  we need to do, and we are actively
> working on this, is to generate accurate utilization and saturation
> metrics for bookies. Once we have these metrics, the entity copying
> the data can do so much faster without impacting production
> traffic.
>
> Cheers,
> Ivan
>

Reply via email to