Hi, Eno, Thanks for the pointers. Doesn't RAID-10 have a similar issue during rebuild? In both cases, all data on existing disks have to be read during rebuild? RAID-10 seems to still be used widely.
Jun On Mon, Feb 27, 2017 at 1:38 PM, Eno Thereska <eno.there...@gmail.com> wrote: > Unfortunately RAID-5/6 is not typically advised anymore due to failure > issues, as Dong mentions, e.g.: http://www.zdnet.com/article/ > why-raid-6-stops-working-in-2019/ <http://www.zdnet.com/article/ > why-raid-6-stops-working-in-2019/> > > Eno > > > > On 27 Feb 2017, at 21:16, Jun Rao <j...@confluent.io> wrote: > > > > Hi, Dong, > > > > For RAID5, I am not sure the rebuild cost is a big concern. If a disk > > fails, typically an admin has to bring down the broker, replace the > failed > > disk with a new one, trigger the RAID rebuild, and bring up the broker. > > This way, there is no performance impact at runtime due to rebuild. The > > benefit is that a broker doesn't fail in a hard way when there is a disk > > failure and can be brought down in a controlled way for maintenance. > While > > the broker is running with a failed disk, reads may be more expensive > since > > they have to be computed from the parity. However, if most reads are from > > page cache, this may not be a big issue either. So, it would be useful to > > do some tests on RAID5 before we completely rule it out. > > > > Regarding whether to remove an offline replica from the fetcher thread > > immediately. What do we do when a failed replica is a leader? Do we do > > nothing or mark the replica as not the leader immediately? Intuitively, > it > > seems it's better if the broker acts consistently on a failed replica > > whether it's a leader or a follower. For ISR churns, I was just pointing > > out that if we don't send StopReplicaRequest to a broker to be shut down > in > > a controlled way, then the leader will shrink ISR, expand it and shrink > it > > again after the timeout. > > > > The KIP seems to still reference " > > /broker/topics/[topic]/partitions/[partitionId]/ > controller_managed_state". > > > > Thanks, > > > > Jun > > > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote: > > > >> Hey Jun, > >> > >> Thanks for the suggestion. I think it is a good idea to know put created > >> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest > if > >> repilcas was in NewReplica state. It will only fail the replica > creation in > >> the scenario that the controller fails after > >> topic-creation/partition-reassignment/partition-number-change but > before > >> actually sends out the LeaderAndIsrRequest while there is ongoing disk > >> failure, which should be pretty rare and acceptable. This should > simplify > >> the design of this KIP. > >> > >> Regarding RAID-5, I think the concern with RAID-5/6 is not just about > >> performance when there is no failure. For example, RAID-5 can support > up to > >> one disk failure and it takes time to rebuild disk after one disk > >> failure. RAID 5 implementations are susceptible to system failures > because > >> of trends regarding array rebuild time and the chance of drive failure > >> during rebuild. There is no such performance degradation for JBOD and > JBOD > >> can support multiple log directory failure without reducing performance > of > >> good log directories. Would this be a reasonable reason for using JBOD > >> instead of RAID-5/6? > >> > >> Previously we discussed wether broker should remove offline replica from > >> replica fetcher thread. I still think it should do it instead of > printing a > >> lot of error in the log4j log. We can still let controller send > >> StopReplicaRequest to the broker. I am not sure I undertand why allowing > >> broker to remove offline replica from fetcher thread will increase > churns > >> in ISR. Do you think this is concern with this approach? > >> > >> I have updated the KIP to remove created flag from ZK and change the > filed > >> name to isNewReplica. Can you check if there is any issue with the > latest > >> KIP? Thanks for your time! > >> > >> Regards, > >> Dong > >> > >> > >> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote: > >> > >>> Hi, Dong, > >>> > >>> Thanks for the reply. > >>> > >>> Personally, I'd prefer not to write the created flag per replica in ZK. > >>> Your suggestion of disabling replica creation if there is a bad log > >>> directory on the broker could work. The only thing is that it may delay > >> the > >>> creation of new replicas. I was thinking that an alternative is to > extend > >>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That > >> field > >>> will be set when a replica is transitioning from the NewReplica state > to > >>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a > >>> replica is marked as the new replica, it will be created on a good log > >>> directory, if not already present. Otherwise, it only creates the > replica > >>> if all log directories are good and the replica is not already present. > >>> This way, we don't delay the processing of new replicas in the common > >> case. > >>> > >>> I am ok with not persisting the offline replicas in ZK and just > >> discovering > >>> them through the LeaderAndIsrRequest. It handles the cases when a > broker > >>> starts up with bad log directories better. So, the additional overhead > of > >>> rediscovering the offline replicas is justified. > >>> > >>> > >>> Another high level question. The proposal rejected RAID5/6 since it > adds > >>> additional I/Os. The main issue with RAID5 is that to write a block > that > >>> doesn't match the RAID stripe size, we have to first read the old > parity > >> to > >>> compute the new one, which increases the number of I/Os ( > >>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you > >> have > >>> tested RAID5's performance by creating a file system whose block size > >>> matches the RAID stripe size (https://www.percona.com/blog/ > >>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a > >> block > >>> doesn't require a read first. A large block size may increase the > amount > >> of > >>> data writes, when the same block has to be written to disk multiple > >> times. > >>> However, this is probably ok in Kafka's use case since we batch the I/O > >>> flush already. As you can see, we will be adding some complexity to > >> support > >>> JBOD in Kafka one way or another. If we can tune the performance of > RAID5 > >>> to match that of RAID10, perhaps using RAID5 is a simpler solution. > >>> > >>> Thanks, > >>> > >>> Jun > >>> > >>> > >>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com> > wrote: > >>> > >>>> Hey Jun, > >>>> > >>>> I don't think we should allow failed replicas to be re-created on the > >>> good > >>>> disks. Say there are 2 disks and each of them is 51% loaded. If any > >> disk > >>>> fail, and we allow replicas to be re-created on the other disks, both > >>> disks > >>>> will fail. Alternatively we can disable replica creation if there is > >> bad > >>>> disk on a broker. I personally think it is worth the additional > >>> complexity > >>>> in the broker to store created replicas in ZK so that we allow new > >>> replicas > >>>> to be created on the broker even when there is bad log directory. This > >>>> approach won't add complexity in the controller. But I am fine with > >>>> disabling replica creation when there is bad log directory that if it > >> is > >>>> the only blocking issue for this KIP. > >>>> > >>>> Whether we store created flags is independent of whether/how we store > >>>> offline replicas. Per our previous discussion, do you think it is OK > >> not > >>>> store offline replicas in ZK and propagate the offline replicas from > >>> broker > >>>> to controller via LeaderAndIsrRequest? > >>>> > >>>> Thanks, > >>>> Dong > >>>> > >>> > >> > >