Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Eno Thereska Mon, 27 Feb 2017 13:39:05 -0800

Unfortunately RAID-5/6 is not typically advised anymore due to failure issues, 
as Dong mentions, e.g.: 
http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/ 
<http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/>


Eno


> On 27 Feb 2017, at 21:16, Jun Rao <j...@confluent.io> wrote:
> 
> Hi, Dong,
> 
> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
> fails, typically an admin has to bring down the broker, replace the failed
> disk with a new one, trigger the RAID rebuild, and bring up the broker.
> This way, there is no performance impact at runtime due to rebuild. The
> benefit is that a broker doesn't fail in a hard way when there is a disk
> failure and can be brought down in a controlled way for maintenance. While
> the broker is running with a failed disk, reads may be more expensive since
> they have to be computed from the parity. However, if most reads are from
> page cache, this may not be a big issue either. So, it would be useful to
> do some tests on RAID5 before we completely rule it out.
> 
> Regarding whether to remove an offline replica from the fetcher thread
> immediately. What do we do when a failed replica is a leader? Do we do
> nothing or mark the replica as not the leader immediately? Intuitively, it
> seems it's better if the broker acts consistently on a failed replica
> whether it's a leader or a follower. For ISR churns, I was just pointing
> out that if we don't send StopReplicaRequest to a broker to be shut down in
> a controlled way, then the leader will shrink ISR, expand it and shrink it
> again after the timeout.
> 
> The KIP seems to still reference "
> /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state".
> 
> Thanks,
> 
> Jun
> 
> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote:
> 
>> Hey Jun,
>> 
>> Thanks for the suggestion. I think it is a good idea to know put created
>> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
>> repilcas was in NewReplica state. It will only fail the replica creation in
>> the scenario that the controller fails after
>> topic-creation/partition-reassignment/partition-number-change but before
>> actually sends out the LeaderAndIsrRequest while there is ongoing disk
>> failure, which should be pretty rare and acceptable. This should simplify
>> the design of this KIP.
>> 
>> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
>> performance when there is no failure. For example, RAID-5 can support up to
>> one disk failure and it takes time to rebuild disk after one disk
>> failure. RAID 5 implementations are susceptible to system failures because
>> of trends regarding array rebuild time and the chance of drive failure
>> during rebuild. There is no such performance degradation for JBOD and JBOD
>> can support multiple log directory failure without reducing performance of
>> good log directories. Would this be a reasonable reason for using JBOD
>> instead of RAID-5/6?
>> 
>> Previously we discussed wether broker should remove offline replica from
>> replica fetcher thread. I still think it should do it instead of printing a
>> lot of error in the log4j log. We can still let controller send
>> StopReplicaRequest to the broker. I am not sure I undertand why allowing
>> broker to remove offline replica from fetcher thread will increase churns
>> in ISR. Do you think this is concern with this approach?
>> 
>> I have updated the KIP to remove created flag from ZK and change the filed
>> name to isNewReplica. Can you check if there is any issue with the latest
>> KIP? Thanks for your time!
>> 
>> Regards,
>> Dong
>> 
>> 
>> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote:
>> 
>>> Hi, Dong,
>>> 
>>> Thanks for the reply.
>>> 
>>> Personally, I'd prefer not to write the created flag per replica in ZK.
>>> Your suggestion of disabling replica creation if there is a bad log
>>> directory on the broker could work. The only thing is that it may delay
>> the
>>> creation of new replicas. I was thinking that an alternative is to extend
>>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That
>> field
>>> will be set when a replica is transitioning from the NewReplica state to
>>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
>>> replica is marked as the new replica, it will be created on a good log
>>> directory, if not already present. Otherwise, it only creates the replica
>>> if all log directories are good and the replica is not already present.
>>> This way, we don't delay the processing of new replicas in the common
>> case.
>>> 
>>> I am ok with not persisting the offline replicas in ZK and just
>> discovering
>>> them through the LeaderAndIsrRequest. It handles the cases when a broker
>>> starts up with bad log directories better. So, the additional overhead of
>>> rediscovering the offline replicas is justified.
>>> 
>>> 
>>> Another high level question. The proposal rejected RAID5/6 since it adds
>>> additional I/Os. The main issue with RAID5 is that to write a block that
>>> doesn't match the RAID stripe size, we have to first read the old parity
>> to
>>> compute the new one, which increases the number of I/Os (
>>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you
>> have
>>> tested RAID5's performance by creating a file system whose block size
>>> matches the RAID stripe size (https://www.percona.com/blog/
>>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a
>> block
>>> doesn't require a read first. A large block size may increase the amount
>> of
>>> data writes, when the same block has to be written to disk multiple
>> times.
>>> However, this is probably ok in Kafka's use case since we batch the I/O
>>> flush already. As you can see, we will be adding some complexity to
>> support
>>> JBOD in Kafka one way or another. If we can tune the performance of RAID5
>>> to match that of RAID10, perhaps using RAID5 is a simpler solution.
>>> 
>>> Thanks,
>>> 
>>> Jun
>>> 
>>> 
>>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com> wrote:
>>> 
>>>> Hey Jun,
>>>> 
>>>> I don't think we should allow failed replicas to be re-created on the
>>> good
>>>> disks. Say there are 2 disks and each of them is 51% loaded. If any
>> disk
>>>> fail, and we allow replicas to be re-created on the other disks, both
>>> disks
>>>> will fail. Alternatively we can disable replica creation if there is
>> bad
>>>> disk on a broker. I personally think it is worth the additional
>>> complexity
>>>> in the broker to store created replicas in ZK so that we allow new
>>> replicas
>>>> to be created on the broker even when there is bad log directory. This
>>>> approach won't add complexity in the controller. But I am fine with
>>>> disabling replica creation when there is bad log directory that if it
>> is
>>>> the only blocking issue for this KIP.
>>>> 
>>>> Whether we store created flags is independent of whether/how we store
>>>> offline replicas. Per our previous discussion, do you think it is OK
>> not
>>>> store offline replicas in ZK and propagate the offline replicas from
>>> broker
>>>> to controller via LeaderAndIsrRequest?
>>>> 
>>>> Thanks,
>>>> Dong
>>>> 
>>> 
>>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to