Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Jun Rao Mon, 27 Feb 2017 13:16:55 -0800

Hi, Dong,

For RAID5, I am not sure the rebuild cost is a big concern. If a disk
fails, typically an admin has to bring down the broker, replace the failed
disk with a new one, trigger the RAID rebuild, and bring up the broker.
This way, there is no performance impact at runtime due to rebuild. The
benefit is that a broker doesn't fail in a hard way when there is a disk
failure and can be brought down in a controlled way for maintenance. While
the broker is running with a failed disk, reads may be more expensive since
they have to be computed from the parity. However, if most reads are from
page cache, this may not be a big issue either. So, it would be useful to
do some tests on RAID5 before we completely rule it out.


Regarding whether to remove an offline replica from the fetcher thread
immediately. What do we do when a failed replica is a leader? Do we do
nothing or mark the replica as not the leader immediately? Intuitively, it
seems it's better if the broker acts consistently on a failed replica
whether it's a leader or a follower. For ISR churns, I was just pointing
out that if we don't send StopReplicaRequest to a broker to be shut down in
a controlled way, then the leader will shrink ISR, expand it and shrink it
again after the timeout.

The KIP seems to still reference "
/broker/topics/[topic]/partitions/[partitionId]/controller_managed_state".

Thanks,

Jun

On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <[email protected]> wrote:

> Hey Jun,
>
> Thanks for the suggestion. I think it is a good idea to know put created
> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
> repilcas was in NewReplica state. It will only fail the replica creation in
> the scenario that the controller fails after
> topic-creation/partition-reassignment/partition-number-change but before
> actually sends out the LeaderAndIsrRequest while there is ongoing disk
> failure, which should be pretty rare and acceptable. This should simplify
> the design of this KIP.
>
> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
> performance when there is no failure. For example, RAID-5 can support up to
> one disk failure and it takes time to rebuild disk after one disk
> failure. RAID 5 implementations are susceptible to system failures because
> of trends regarding array rebuild time and the chance of drive failure
> during rebuild. There is no such performance degradation for JBOD and JBOD
> can support multiple log directory failure without reducing performance of
> good log directories. Would this be a reasonable reason for using JBOD
> instead of RAID-5/6?
>
> Previously we discussed wether broker should remove offline replica from
> replica fetcher thread. I still think it should do it instead of printing a
> lot of error in the log4j log. We can still let controller send
> StopReplicaRequest to the broker. I am not sure I undertand why allowing
> broker to remove offline replica from fetcher thread will increase churns
> in ISR. Do you think this is concern with this approach?
>
> I have updated the KIP to remove created flag from ZK and change the filed
> name to isNewReplica. Can you check if there is any issue with the latest
> KIP? Thanks for your time!
>
> Regards,
> Dong
>
>
> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <[email protected]> wrote:
>
> > Hi, Dong,
> >
> > Thanks for the reply.
> >
> > Personally, I'd prefer not to write the created flag per replica in ZK.
> > Your suggestion of disabling replica creation if there is a bad log
> > directory on the broker could work. The only thing is that it may delay
> the
> > creation of new replicas. I was thinking that an alternative is to extend
> > LeaderAndIsrRequest by adding a isNewReplica field per replica. That
> field
> > will be set when a replica is transitioning from the NewReplica state to
> > Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
> > replica is marked as the new replica, it will be created on a good log
> > directory, if not already present. Otherwise, it only creates the replica
> > if all log directories are good and the replica is not already present.
> > This way, we don't delay the processing of new replicas in the common
> case.
> >
> > I am ok with not persisting the offline replicas in ZK and just
> discovering
> > them through the LeaderAndIsrRequest. It handles the cases when a broker
> > starts up with bad log directories better. So, the additional overhead of
> > rediscovering the offline replicas is justified.
> >
> >
> > Another high level question. The proposal rejected RAID5/6 since it adds
> > additional I/Os. The main issue with RAID5 is that to write a block that
> > doesn't match the RAID stripe size, we have to first read the old parity
> to
> > compute the new one, which increases the number of I/Os (
> > http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you
> have
> > tested RAID5's performance by creating a file system whose block size
> > matches the RAID stripe size (https://www.percona.com/blog/
> > 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a
> block
> > doesn't require a read first. A large block size may increase the amount
> of
> > data writes, when the same block has to be written to disk multiple
> times.
> > However, this is probably ok in Kafka's use case since we batch the I/O
> > flush already. As you can see, we will be adding some complexity to
> support
> > JBOD in Kafka one way or another. If we can tune the performance of RAID5
> > to match that of RAID10, perhaps using RAID5 is a simpler solution.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <[email protected]> wrote:
> >
> > > Hey Jun,
> > >
> > > I don't think we should allow failed replicas to be re-created on the
> > good
> > > disks. Say there are 2 disks and each of them is 51% loaded. If any
> disk
> > > fail, and we allow replicas to be re-created on the other disks, both
> > disks
> > > will fail. Alternatively we can disable replica creation if there is
> bad
> > > disk on a broker. I personally think it is worth the additional
> > complexity
> > > in the broker to store created replicas in ZK so that we allow new
> > replicas
> > > to be created on the broker even when there is bad log directory. This
> > > approach won't add complexity in the controller. But I am fine with
> > > disabling replica creation when there is bad log directory that if it
> is
> > > the only blocking issue for this KIP.
> > >
> > > Whether we store created flags is independent of whether/how we store
> > > offline replicas. Per our previous discussion, do you think it is OK
> not
> > > store offline replicas in ZK and propagate the offline replicas from
> > broker
> > > to controller via LeaderAndIsrRequest?
> > >
> > > Thanks,
> > > Dong
> > >
> >
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to