Hey Sriram,

I think there is one way to explain why the ability to move replica between
disks can save space. Let's say the load is distributed to disks
independent of the broker. Sooner or later, the load imbalance will exceed
a threshold and we will need to rebalance load across disks. Now our
questions is whether our rebalancing algorithm will be able to take
advantage of locality by moving replicas between disks on the same broker.

Say for a given disk, there is 20% probability it is overloaded, 20%
probability it is underloaded, and 60% probability its load is around the
expected average load if the cluster is well balanced. Then for a broker of
10 disks, we would 2 disks need to have in-bound replica movement, 2 disks
need to have out-bound replica movement, and 6 disks do not need replica
movement. Thus we would expect KIP-113 to be useful since we will be able
to move replica from the two over-loaded disks to the two under-loaded
disks on the same broKER. Does this make sense?

Thanks,
Dong






On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Sriram,
>
> Thanks for raising these concerns. Let me answer these questions below:
>
> - The benefit of those additional complexity to move the data stored on a
> disk within the broker is to avoid network bandwidth usage. Creating
> replica on another broker is less efficient than creating replica on
> another disk in the same broker IF there is actually lightly-loaded disk on
> the same broker.
>
> - In my opinion the rebalance algorithm would this: 1) we balance the load
> across brokers using the same algorithm we are using today. 2) we balance
> load across disk on a given broker using a greedy algorithm, i.e. move
> replica from the overloaded disk to lightly loaded disk. The greedy
> algorithm would only consider the capacity and replica size. We can improve
> it to consider throughput in the future.
>
> - With 30 brokers with each having 10 disks, using the rebalancing algorithm,
> the chances of choosing disks within the broker can be high. There will
> always be load imbalance across disks of the same broker for the same
> reason that there will always be load imbalance across brokers. The
> algorithm specified above will take advantage of the locality, i.e. first
> balance load across disks of the same broker, and only balance across
> brokers if some brokers are much more loaded than others.
>
> I think it is useful to note that the load imbalance across disks of the
> same broker is independent of the load imbalance across brokers. Both are
> guaranteed to happen in any Kafka cluster for the same reason, i.e.
> variation in the partition size. Say broker 1 have two disks that are 80%
> loaded and 20% loaded. And broker 2 have two disks that are also 80%
> loaded and 20%. We can balance them without inter-broker traffic with
> KIP-113.  This is why I think KIP-113 can be very useful.
>
> Do these explanation sound reasonable?
>
> Thanks,
> Dong
>
>
> On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <r...@confluent.io>
> wrote:
>
>> Hey Dong,
>>
>> Thanks for the explanation. I don't think anyone is denying that we should
>> rebalance at the disk level. I think it is important to restore the disk
>> and not wait for disk replacement. There are also other benefits of doing
>> that which is that you don't need to opt for hot swap racks that can save
>> cost.
>>
>> The question here is what do you save by trying to add complexity to move
>> the data stored on a disk within the broker? Why would you not simply
>> create another replica on the disk that results in a balanced load across
>> brokers and have it catch up. We are missing a few things here -
>> 1. What would your data balancing algorithm be? Would it include just
>> capacity or will it also consider throughput on disk to decide on the
>> final
>> location of a partition?
>> 2. With 30 brokers with each having 10 disks, using the rebalancing
>> algorithm, the chances of choosing disks within the broker is going to be
>> low. This probability further decreases with more brokers and disks. Given
>> that, why are we trying to save network cost? How much would that saving
>> be
>> if you go that route?
>>
>> These questions are hard to answer without having to verify empirically.
>> My
>> suggestion is to avoid doing pre matured optimization that brings in the
>> added complexity to the code and treat inter and intra broker movements of
>> partition the same. Deploy the code, use it and see if it is an actual
>> problem and you get great savings by avoiding the network route to move
>> partitions within the same broker. If so, add this optimization.
>>
>> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <lindon...@gmail.com> wrote:
>>
>> > Hey Jay, Sriram,
>> >
>> > Great point. If I understand you right, you are suggesting that we can
>> > simply use RAID-0 so that the load can be evenly distributed across
>> disks.
>> > And even though a disk failure will bring down the enter broker, the
>> > reduced availability as compared to using KIP-112 and KIP-113 will may
>> be
>> > negligible. And it may be better to just accept the slightly reduced
>> > availability instead of introducing the complexity from KIP-112 and
>> > KIP-113.
>> >
>> > Let's assume the following:
>> >
>> > - There are 30 brokers in a cluster and each broker has 10 disks
>> > - The replication factor is 3 and min.isr = 2.
>> > - The probability of annual disk failure rate is 2% according to this
>> > <https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/>
>> blog.
>> > - It takes 3 days to replace a disk.
>> >
>> > Here is my calculation for probability of data loss due to disk failure:
>> > probability of a given disk fails in a given year: 2%
>> > probability of a given disk stays offline for one day in a given day:
>> 2% /
>> > 365 * 3
>> > probability of a given broker stays offline for one day in a given day
>> due
>> > to disk failure: 2% / 365 * 3 * 10
>> > probability of any broker stays offline for one day in a given day due
>> to
>> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
>> > probability of any three broker stays offline for one day in a given day
>> > due to disk failure: 5% * 5% * 5% = 0.0125%
>> > probability of data loss due to disk failure: 0.0125%
>> >
>> > Here is my calculation for probability of service unavailability due to
>> > disk failure:
>> > probability of a given disk fails in a given year: 2%
>> > probability of a given disk stays offline for one day in a given day:
>> 2% /
>> > 365 * 3
>> > probability of a given broker stays offline for one day in a given day
>> due
>> > to disk failure: 2% / 365 * 3 * 10
>> > probability of any broker stays offline for one day in a given day due
>> to
>> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
>> > probability of any two broker stays offline for one day in a given day
>> due
>> > to disk failure: 5% * 5% * 5% = 0.25%
>> > probability of unavailability due to disk failure: 0.25%
>> >
>> > Note that the unavailability due to disk failure will be unacceptably
>> high
>> > in this case. And the probability of data loss due to disk failure will
>> be
>> > higher than 0.01%. Neither is acceptable if Kafka is intended to achieve
>> > four nigh availability.
>> >
>> > Thanks,
>> > Dong
>> >
>> >
>> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <j...@confluent.io> wrote:
>> >
>> > > I think Ram's point is that in place failure is pretty complicated,
>> and
>> > > this is meant to be a cost saving feature, we should construct an
>> > argument
>> > > for it grounded in data.
>> > >
>> > > Assume an annual failure rate of 1% (reasonable, but data is available
>> > > online), and assume it takes 3 days to get the drive replaced. Say you
>> > have
>> > > 10 drives per server. Then the expected downtime for each server is
>> > roughly
>> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off since I'm
>> ignoring
>> > > the case of multiple failures, but I don't know that changes it
>> much). So
>> > > the savings from this feature is 0.3/365 = 0.08%. Say you have 1000
>> > servers
>> > > and they cost $3000/year fully loaded including power, the cost of
>> the hw
>> > > amortized over it's life, etc. Then this feature saves you $3000 on
>> your
>> > > total server cost of $3m which seems not very worthwhile compared to
>> > other
>> > > optimizations...?
>> > >
>> > > Anyhow, not sure the arithmetic is right there, but i think that is
>> the
>> > > type of argument that would be helpful to think about the tradeoff in
>> > > complexity.
>> > >
>> > > -Jay
>> > >
>> > >
>> > >
>> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <lindon...@gmail.com> wrote:
>> > >
>> > > > Hey Sriram,
>> > > >
>> > > > Thanks for taking time to review the KIP. Please see below my
>> answers
>> > to
>> > > > your questions:
>> > > >
>> > > > >1. Could you pick a hardware/Kafka configuration and go over what
>> is
>> > the
>> > > > >average disk/partition repair/restore time that we are targeting
>> for a
>> > > > >typical JBOD setup?
>> > > >
>> > > > We currently don't have this data. I think the disk/partition
>> > > repair/store
>> > > > time depends on availability of hardware, the response time of
>> > > > site-reliability engineer, the amount of data on the bad disk etc.
>> > These
>> > > > vary between companies and even clusters within the same company
>> and it
>> > > is
>> > > > probably hard to determine what is the average situation.
>> > > >
>> > > > I am not very sure why we need this. Can you explain a bit why this
>> > data
>> > > is
>> > > > useful to evaluate the motivation and design of this KIP?
>> > > >
>> > > > >2. How often do we believe disks are going to fail (in your example
>> > > > >configuration) and what do we gain by avoiding the network overhead
>> > and
>> > > > >doing all the work of moving the replica within the broker to
>> another
>> > > disk
>> > > > >instead of balancing it globally?
>> > > >
>> > > > I think the chance of disk failure depends mainly on the disk itself
>> > > rather
>> > > > than the broker configuration. I don't have this data now. I will
>> ask
>> > our
>> > > > SRE whether they know the mean-time-to-fail for our disk. What I was
>> > told
>> > > > by SRE is that disk failure is the most common type of hardware
>> > failure.
>> > > >
>> > > > When there is disk failure, I think it is reasonable to move
>> replica to
>> > > > another broker instead of another disk on the same broker. The
>> reason
>> > we
>> > > > want to move replica within broker is mainly to optimize the Kafka
>> > > cluster
>> > > > performance when we balance load across disks.
>> > > >
>> > > > In comparison to balancing replicas globally, the benefit of moving
>> > > replica
>> > > > within broker is that:
>> > > >
>> > > > 1) the movement is faster since it doesn't go through socket or
>> rely on
>> > > the
>> > > > available network bandwidth;
>> > > > 2) much less impact on the replication traffic between broker by not
>> > > taking
>> > > > up bandwidth between brokers. Depending on the pattern of traffic,
>> we
>> > may
>> > > > need to balance load across disk frequently and it is necessary to
>> > > prevent
>> > > > this operation from slowing down the existing operation (e.g.
>> produce,
>> > > > consume, replication) in the Kafka cluster.
>> > > > 3) It gives us opportunity to do automatic broker rebalance between
>> > disks
>> > > > on the same broker.
>> > > >
>> > > >
>> > > > >3. Even if we had to move the replica within the broker, why
>> cannot we
>> > > > just
>> > > > >treat it as another replica and have it go through the same
>> > replication
>> > > > >code path that we have today? The downside here is obviously that
>> you
>> > > need
>> > > > >to catchup from the leader but it is completely free! What do we
>> think
>> > > is
>> > > > >the impact of the network overhead in this case?
>> > > >
>> > > > Good point. My initial proposal actually used the existing
>> > > > ReplicaFetcherThread (i.e. the existing code path) to move replica
>> > > between
>> > > > disks. However, I switched to use separate thread pool after
>> discussion
>> > > > with Jun and Becket.
>> > > >
>> > > > The main argument for using separate thread pool is to actually keep
>> > the
>> > > > design simply and easy to reason about. There are a number of
>> > difference
>> > > > between inter-broker replication and intra-broker replication which
>> > makes
>> > > > it cleaner to do them in separate code path. I will list them below:
>> > > >
>> > > > - The throttling mechanism for inter-broker replication traffic and
>> > > > intra-broker replication traffic is different. For example, we may
>> want
>> > > to
>> > > > specify per-topic quota for inter-broker replication traffic
>> because we
>> > > may
>> > > > want some topic to be moved faster than other topic. But we don't
>> care
>> > > > about priority of topics for intra-broker movement. So the current
>> > > proposal
>> > > > only allows user to specify per-broker quota for inter-broker
>> > replication
>> > > > traffic.
>> > > >
>> > > > - The quota value for inter-broker replication traffic and
>> intra-broker
>> > > > replication traffic is different. The available bandwidth for
>> > > inter-broker
>> > > > replication can probably be much higher than the bandwidth for
>> > > inter-broker
>> > > > replication.
>> > > >
>> > > > - The ReplicaFetchThread is per broker. Intuitively, the number of
>> > > threads
>> > > > doing intra broker data movement should be related to the number of
>> > disks
>> > > > in the broker, not the number of brokers in the cluster.
>> > > >
>> > > > - The leader replica has no ReplicaFetchThread to start with. It
>> seems
>> > > > weird to
>> > > > start one just for intra-broker replication.
>> > > >
>> > > > Because of these difference, we think it is simpler to use separate
>> > > thread
>> > > > pool and code path so that we can configure and throttle them
>> > separately.
>> > > >
>> > > >
>> > > > >4. What are the chances that we will be able to identify another
>> disk
>> > to
>> > > > >balance within the broker instead of another disk on another
>> broker?
>> > If
>> > > we
>> > > > >have 100's of machines, the probability of finding a better
>> balance by
>> > > > >choosing another broker is much higher than balancing within the
>> > broker.
>> > > > >Could you add some info on how we are determining this?
>> > > >
>> > > > It is possible that we can find available space on a remote broker.
>> The
>> > > > benefit of allowing intra-broker replication is that, when there are
>> > > > available space in both the current broker and a remote broker, the
>> > > > rebalance can be completed faster with much less impact on the
>> > > inter-broker
>> > > > replication or the users traffic. It is about taking advantage of
>> > > locality
>> > > > when balance the load.
>> > > >
>> > > > >5. Finally, in a cloud setup where more users are going to
>> leverage a
>> > > > >shared filesystem (example, EBS in AWS), all this change is not of
>> > much
>> > > > >gain since you don't need to balance between the volumes within the
>> > same
>> > > > >broker.
>> > > >
>> > > > You are right. This KIP-113 is useful only if user uses JBOD. If
>> user
>> > > uses
>> > > > an extra storage layer of replication, such as RAID-10 or EBS, they
>> > don't
>> > > > need KIP-112 or KIP-113. Note that user will replicate data more
>> times
>> > > than
>> > > > the replication factor of the Kafka topic if an extra storage layer
>> of
>> > > > replication is used.
>> > > >
>> > >
>> >
>>
>
>

Reply via email to