Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Jun Rao Mon, 27 Feb 2017 14:10:42 -0800

Hi, Eno,

Thanks for the pointers. Doesn't RAID-10 have a similar issue during
rebuild? In both cases, all data on existing disks have to be read during
rebuild? RAID-10 seems to still be used widely.


Jun

On Mon, Feb 27, 2017 at 1:38 PM, Eno Thereska <eno.there...@gmail.com>
wrote:

> Unfortunately RAID-5/6 is not typically advised anymore due to failure
> issues, as Dong mentions, e.g.: http://www.zdnet.com/article/
> why-raid-6-stops-working-in-2019/ <http://www.zdnet.com/article/
> why-raid-6-stops-working-in-2019/>
>
> Eno
>
>
> > On 27 Feb 2017, at 21:16, Jun Rao <j...@confluent.io> wrote:
> >
> > Hi, Dong,
> >
> > For RAID5, I am not sure the rebuild cost is a big concern. If a disk
> > fails, typically an admin has to bring down the broker, replace the
> failed
> > disk with a new one, trigger the RAID rebuild, and bring up the broker.
> > This way, there is no performance impact at runtime due to rebuild. The
> > benefit is that a broker doesn't fail in a hard way when there is a disk
> > failure and can be brought down in a controlled way for maintenance.
> While
> > the broker is running with a failed disk, reads may be more expensive
> since
> > they have to be computed from the parity. However, if most reads are from
> > page cache, this may not be a big issue either. So, it would be useful to
> > do some tests on RAID5 before we completely rule it out.
> >
> > Regarding whether to remove an offline replica from the fetcher thread
> > immediately. What do we do when a failed replica is a leader? Do we do
> > nothing or mark the replica as not the leader immediately? Intuitively,
> it
> > seems it's better if the broker acts consistently on a failed replica
> > whether it's a leader or a follower. For ISR churns, I was just pointing
> > out that if we don't send StopReplicaRequest to a broker to be shut down
> in
> > a controlled way, then the leader will shrink ISR, expand it and shrink
> it
> > again after the timeout.
> >
> > The KIP seems to still reference "
> > /broker/topics/[topic]/partitions/[partitionId]/
> controller_managed_state".
> >
> > Thanks,
> >
> > Jun
> >
> > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote:
> >
> >> Hey Jun,
> >>
> >> Thanks for the suggestion. I think it is a good idea to know put created
> >> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest
> if
> >> repilcas was in NewReplica state. It will only fail the replica
> creation in
> >> the scenario that the controller fails after
> >> topic-creation/partition-reassignment/partition-number-change but
> before
> >> actually sends out the LeaderAndIsrRequest while there is ongoing disk
> >> failure, which should be pretty rare and acceptable. This should
> simplify
> >> the design of this KIP.
> >>
> >> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
> >> performance when there is no failure. For example, RAID-5 can support
> up to
> >> one disk failure and it takes time to rebuild disk after one disk
> >> failure. RAID 5 implementations are susceptible to system failures
> because
> >> of trends regarding array rebuild time and the chance of drive failure
> >> during rebuild. There is no such performance degradation for JBOD and
> JBOD
> >> can support multiple log directory failure without reducing performance
> of
> >> good log directories. Would this be a reasonable reason for using JBOD
> >> instead of RAID-5/6?
> >>
> >> Previously we discussed wether broker should remove offline replica from
> >> replica fetcher thread. I still think it should do it instead of
> printing a
> >> lot of error in the log4j log. We can still let controller send
> >> StopReplicaRequest to the broker. I am not sure I undertand why allowing
> >> broker to remove offline replica from fetcher thread will increase
> churns
> >> in ISR. Do you think this is concern with this approach?
> >>
> >> I have updated the KIP to remove created flag from ZK and change the
> filed
> >> name to isNewReplica. Can you check if there is any issue with the
> latest
> >> KIP? Thanks for your time!
> >>
> >> Regards,
> >> Dong
> >>
> >>
> >> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote:
> >>
> >>> Hi, Dong,
> >>>
> >>> Thanks for the reply.
> >>>
> >>> Personally, I'd prefer not to write the created flag per replica in ZK.
> >>> Your suggestion of disabling replica creation if there is a bad log
> >>> directory on the broker could work. The only thing is that it may delay
> >> the
> >>> creation of new replicas. I was thinking that an alternative is to
> extend
> >>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That
> >> field
> >>> will be set when a replica is transitioning from the NewReplica state
> to
> >>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
> >>> replica is marked as the new replica, it will be created on a good log
> >>> directory, if not already present. Otherwise, it only creates the
> replica
> >>> if all log directories are good and the replica is not already present.
> >>> This way, we don't delay the processing of new replicas in the common
> >> case.
> >>>
> >>> I am ok with not persisting the offline replicas in ZK and just
> >> discovering
> >>> them through the LeaderAndIsrRequest. It handles the cases when a
> broker
> >>> starts up with bad log directories better. So, the additional overhead
> of
> >>> rediscovering the offline replicas is justified.
> >>>
> >>>
> >>> Another high level question. The proposal rejected RAID5/6 since it
> adds
> >>> additional I/Os. The main issue with RAID5 is that to write a block
> that
> >>> doesn't match the RAID stripe size, we have to first read the old
> parity
> >> to
> >>> compute the new one, which increases the number of I/Os (
> >>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you
> >> have
> >>> tested RAID5's performance by creating a file system whose block size
> >>> matches the RAID stripe size (https://www.percona.com/blog/
> >>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a
> >> block
> >>> doesn't require a read first. A large block size may increase the
> amount
> >> of
> >>> data writes, when the same block has to be written to disk multiple
> >> times.
> >>> However, this is probably ok in Kafka's use case since we batch the I/O
> >>> flush already. As you can see, we will be adding some complexity to
> >> support
> >>> JBOD in Kafka one way or another. If we can tune the performance of
> RAID5
> >>> to match that of RAID10, perhaps using RAID5 is a simpler solution.
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>>
> >>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com>
> wrote:
> >>>
> >>>> Hey Jun,
> >>>>
> >>>> I don't think we should allow failed replicas to be re-created on the
> >>> good
> >>>> disks. Say there are 2 disks and each of them is 51% loaded. If any
> >> disk
> >>>> fail, and we allow replicas to be re-created on the other disks, both
> >>> disks
> >>>> will fail. Alternatively we can disable replica creation if there is
> >> bad
> >>>> disk on a broker. I personally think it is worth the additional
> >>> complexity
> >>>> in the broker to store created replicas in ZK so that we allow new
> >>> replicas
> >>>> to be created on the broker even when there is bad log directory. This
> >>>> approach won't add complexity in the controller. But I am fine with
> >>>> disabling replica creation when there is bad log directory that if it
> >> is
> >>>> the only blocking issue for this KIP.
> >>>>
> >>>> Whether we store created flags is independent of whether/how we store
> >>>> offline replicas. Per our previous discussion, do you think it is OK
> >> not
> >>>> store offline replicas in ZK and propagate the offline replicas from
> >>> broker
> >>>> to controller via LeaderAndIsrRequest?
> >>>>
> >>>> Thanks,
> >>>> Dong
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to