Hi Dong, Your comments on KIP-179 prompted me to look at KIP-113, and I have a question:

AFAICS the DescribeDirsResponse (via ReplicaInfo) can be used to get the size of a partition on a disk, but I don't see a mechanism for knowing the total capacity of a disk (and/or the free capacity of a disk). That would be very useful information to have to help figure out that certain assignments are impossible, for instance. Is there a reason you've left this out? Cheers, Tom On 4 August 2017 at 18:47, Dong Lin <lindon...@gmail.com> wrote: > Hey Ismael, > > Thanks for the comments! Here are my answers: > > 1. Yes it has been considered. Here are the reasons why we don't do it > through controller. > > - There can be use-cases where we only want to rebalance the load of log > directories on a given broker. It seems unnecessary to go through > controller in this case. > > - If controller is responsible for sending ChangeReplicaDirRequest, and if > the user-specified log directory is either invalid or offline, then > controller probably needs a way to tell user that the partition > reassignment has failed. We currently don't have a way to do this since > kafka-reassign-partition.sh simply creates the reassignment znode without > waiting for response. I am not sure that is a good solution to this. > > - If controller is responsible for sending ChangeReplicaDirRequest, the > controller logic would be more complicated because controller needs to > first send ChangeReplicaRequest so that the broker memorize the partition > -> log directory mapping, send LeaderAndIsrRequest, and keep sending > ChangeReplicaDirRequest (just in case broker restarted) until replica is > created. Note that the last step needs repeat and timeout as the proposed > in the KIP-113. > > Overall I think this adds quite a bit complexity to controller and we > probably want to do this only if there is strong clear of doing so. > Currently in KIP-113 the kafka-reassign-partitions.sh is responsible for > sending ChangeReplicaDirRequest with repeat and provides error to user if > it either fails or timeout. It seems to be much simpler and user shouldn't > care whether it is done through controller. > > And thanks for the suggestion. I will add this to the Rejected Alternative > Section in the KIP-113. > > 2) I think user needs to be able to specify different log directories for > the replicas of the same partition in order to rebalance load across log > directories of all brokers. I am not sure I understand the question. Can > you explain a bit more why "that the log directory has to be the same for > all replicas of a given partition"? > > 3) Good point. I think the alterReplicaDir is a better than > changeReplicaDir for the reason you provided. I will also update names of > the request/response as well in the KIP. > > > Thanks, > Dong > > On Fri, Aug 4, 2017 at 9:49 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > Thanks Dong. I have a few initial questions, sorry if I it has been > > discussed and I missed it. > > > > 1. The KIP suggests that the reassignment tool is responsible for sending > > the ChangeReplicaDirRequests to the relevant brokers. I had imagined that > > this would be done by the Controller, like the rest of the reassignment > > process. Was this considered? If so, it would be good to include the > > details of why it was rejected in the "Rejected Alternatives" section. > > > > 2. The reassignment JSON format was extended so that one can choose the > log > > directory for a partition. This means that the log directory has to be > the > > same for all replicas of a given partition. The alternative would be for > > the log dir to be assignable for each replica. Similar to the other > > question, it would be good to have a section in "Rejected Alternatives" > for > > this approach. It's generally very helpful to have more information on > the > > rationale for the design choices that were made and rejected. > > > > 3. Should changeReplicaDir be alterReplicaDir? We have used `alter` for > > other methods. > > > > Thanks, > > Ismael > > > > > > > > > > On Fri, Aug 4, 2017 at 5:37 AM, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hi all, > > > > > > I realized that we need new API in AdminClient in order to use the new > > > request/response added in KIP-113. Since this is required by KIP-113, I > > > choose to add the new interface in this KIP instead of creating a new > > KIP. > > > > > > The documentation of the new API in AdminClient can be found here > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113: > > > Supportreplicasmovementbetweenlogdirectories-AdminClient>. > > > Can you please review and comment if you have any concern? > > > > > > Thanks! > > > Dong > > > > > > > > > > > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> > wrote: > > > > > > > The protocol change has been updated in KIP-113 > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 113%3A+Support+replicas+movement+between+log+directories> > > > > . > > > > > > > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> > > wrote: > > > > > > > >> Hi all, > > > >> > > > >> I have made a minor change to the DescribeDirsRequest so that user > can > > > >> choose to query the status for a specific list of partitions. This > is > > a > > > bit > > > >> more fine-granular than the previous format that allows user to > query > > > the > > > >> status for a specific list of topics. I realized that querying the > > > status > > > >> of selected partitions can be useful to check the whether the > > > reassignment > > > >> of the replicas to the specific log directories has been completed. > > > >> > > > >> I will assume this minor change is OK if there is no concern with it > > in > > > >> the community :) > > > >> > > > >> Thanks, > > > >> Dong > > > >> > > > >> > > > >> > > > >> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <lindon...@gmail.com> > > wrote: > > > >> > > > >>> Hey Colin, > > > >>> > > > >>> Thanks for the suggestion. We have actually considered this and > list > > > >>> this as the first future work in KIP-112 > > > >>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 112%3A+Handle+disk+failure+for+JBOD>. > > > >>> The two advantages that you mentioned are exactly the motivation > for > > > this > > > >>> feature. Also as you have mentioned, this involves the tradeoff > > between > > > >>> disk performance and availability -- the more you distribute topic > > > across > > > >>> disks, the more topics will be offline due to a single disk > failure. > > > >>> > > > >>> Despite its complexity, it is not clear to me that the reduced > > > rebalance > > > >>> overhead is worth the reduction in availability. I am optimistic > that > > > the > > > >>> rebalance overhead will not be that a big problem since we are not > > too > > > >>> bothered by cross-broker rebalance as of now. > > > >>> > > > >>> Thanks, > > > >>> Dong > > > >>> > > > >>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <cmcc...@apache.org > > > > > >>> wrote: > > > >>> > > > >>>> Has anyone considered a scheme for sharding topic data across > > multiple > > > >>>> disks? > > > >>>> > > > >>>> For example, if you sharded topics across 3 disks, and you had 10 > > > disks, > > > >>>> you could pick a different set of 3 disks for each topic. If you > > > >>>> distribute them randomly then you have 10 choose 3 = 120 different > > > >>>> combinations. You would probably never need rebalancing if you > had > > a > > > >>>> reasonable distribution of topic sizes (could probably prove this > > > with a > > > >>>> Monte Carlo or something). > > > >>>> > > > >>>> The disadvantage is that if one of the 3 disks fails, then you > have > > to > > > >>>> take the topic offline. But if we assume independent disk failure > > > >>>> probabilities, probability of failure with RAID 0 is: 1 - > > > >>>> Psuccess^(num_disks) whereas the probability of failure with this > > > scheme > > > >>>> is 1 - Psuccess ^ 3. > > > >>>> > > > >>>> This addresses the biggest downsides of JBOD now: > > > >>>> * limiting a topic to the size of a single disk limits scalability > > > >>>> * the topic movement process is tricky to get right and involves > > > "racing > > > >>>> against producers" and wasted double I/Os > > > >>>> > > > >>>> Of course, one other question is how frequently we add new disk > > drives > > > >>>> to an existing broker. In this case, you might reasonably want > disk > > > >>>> rebalancing to avoid overloading the new disk(s) with writes. > > > >>>> > > > >>>> cheers, > > > >>>> Colin > > > >>>> > > > >>>> > > > >>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote: > > > >>>> > Just a few comments on this. > > > >>>> > > > > >>>> > 1. One of the issues with using RAID 0 is that a single disk > > failure > > > >>>> > causes > > > >>>> > a hard failure of the broker. Hard failure increases the > > > >>>> unavailability > > > >>>> > window for all the partitions on the failed broker, which > includes > > > the > > > >>>> > failure detection time (tied to ZK session timeout right now) > and > > > >>>> leader > > > >>>> > election time by the controller. If we support JBOD natively, > > when a > > > >>>> > single > > > >>>> > disk fails, only partitions on the failed disk will experience a > > > hard > > > >>>> > failure. The availability for partitions on the rest of the > disks > > > are > > > >>>> not > > > >>>> > affected. > > > >>>> > > > > >>>> > 2. For running things on the Cloud such as AWS. Currently, each > > EBS > > > >>>> > volume > > > >>>> > has a throughout limit of about 300MB/sec. If you get an > enhanced > > > EC2 > > > >>>> > instance, you can get 20Gb/sec network. To saturate the network, > > you > > > >>>> may > > > >>>> > need about 7 EBS volumes. So, being able to support JBOD in the > > > Cloud > > > >>>> is > > > >>>> > still potentially useful. > > > >>>> > > > > >>>> > 3. On the benefit of balancing data across disks within the same > > > >>>> broker. > > > >>>> > Data imbalance can happen across brokers as well as across disks > > > >>>> within > > > >>>> > the > > > >>>> > same broker. Balancing the data across disks within the broker > has > > > the > > > >>>> > benefit of saving network bandwidth as Dong mentioned. So, if > > intra > > > >>>> > broker > > > >>>> > load balancing is possible, it's probably better to avoid the > more > > > >>>> > expensive inter broker load balancing. One of the reasons for > disk > > > >>>> > imbalance right now is that partitions within a broker are > > assigned > > > to > > > >>>> > disks just based on the partition count. So, it does seem > possible > > > for > > > >>>> > disks to get imbalanced from time to time. If someone can share > > some > > > >>>> > stats > > > >>>> > for that in practice, that will be very helpful. > > > >>>> > > > > >>>> > Thanks, > > > >>>> > > > > >>>> > Jun > > > >>>> > > > > >>>> > > > > >>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <lindon...@gmail.com> > > > wrote: > > > >>>> > > > > >>>> > > Hey Sriram, > > > >>>> > > > > > >>>> > > I think there is one way to explain why the ability to move > > > replica > > > >>>> between > > > >>>> > > disks can save space. Let's say the load is distributed to > disks > > > >>>> > > independent of the broker. Sooner or later, the load imbalance > > > will > > > >>>> exceed > > > >>>> > > a threshold and we will need to rebalance load across disks. > Now > > > our > > > >>>> > > questions is whether our rebalancing algorithm will be able to > > > take > > > >>>> > > advantage of locality by moving replicas between disks on the > > same > > > >>>> broker. > > > >>>> > > > > > >>>> > > Say for a given disk, there is 20% probability it is > overloaded, > > > 20% > > > >>>> > > probability it is underloaded, and 60% probability its load is > > > >>>> around the > > > >>>> > > expected average load if the cluster is well balanced. Then > for > > a > > > >>>> broker of > > > >>>> > > 10 disks, we would 2 disks need to have in-bound replica > > movement, > > > >>>> 2 disks > > > >>>> > > need to have out-bound replica movement, and 6 disks do not > need > > > >>>> replica > > > >>>> > > movement. Thus we would expect KIP-113 to be useful since we > > will > > > >>>> be able > > > >>>> > > to move replica from the two over-loaded disks to the two > > > >>>> under-loaded > > > >>>> > > disks on the same broKER. Does this make sense? > > > >>>> > > > > > >>>> > > Thanks, > > > >>>> > > Dong > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <lindon...@gmail.com > > > > > >>>> wrote: > > > >>>> > > > > > >>>> > > > Hey Sriram, > > > >>>> > > > > > > >>>> > > > Thanks for raising these concerns. Let me answer these > > questions > > > >>>> below: > > > >>>> > > > > > > >>>> > > > - The benefit of those additional complexity to move the > data > > > >>>> stored on a > > > >>>> > > > disk within the broker is to avoid network bandwidth usage. > > > >>>> Creating > > > >>>> > > > replica on another broker is less efficient than creating > > > replica > > > >>>> on > > > >>>> > > > another disk in the same broker IF there is actually > > > >>>> lightly-loaded disk > > > >>>> > > on > > > >>>> > > > the same broker. > > > >>>> > > > > > > >>>> > > > - In my opinion the rebalance algorithm would this: 1) we > > > balance > > > >>>> the > > > >>>> > > load > > > >>>> > > > across brokers using the same algorithm we are using today. > 2) > > > we > > > >>>> balance > > > >>>> > > > load across disk on a given broker using a greedy algorithm, > > > i.e. > > > >>>> move > > > >>>> > > > replica from the overloaded disk to lightly loaded disk. The > > > >>>> greedy > > > >>>> > > > algorithm would only consider the capacity and replica size. > > We > > > >>>> can > > > >>>> > > improve > > > >>>> > > > it to consider throughput in the future. > > > >>>> > > > > > > >>>> > > > - With 30 brokers with each having 10 disks, using the > > > rebalancing > > > >>>> > > algorithm, > > > >>>> > > > the chances of choosing disks within the broker can be high. > > > >>>> There will > > > >>>> > > > always be load imbalance across disks of the same broker for > > the > > > >>>> same > > > >>>> > > > reason that there will always be load imbalance across > > brokers. > > > >>>> The > > > >>>> > > > algorithm specified above will take advantage of the > locality, > > > >>>> i.e. first > > > >>>> > > > balance load across disks of the same broker, and only > balance > > > >>>> across > > > >>>> > > > brokers if some brokers are much more loaded than others. > > > >>>> > > > > > > >>>> > > > I think it is useful to note that the load imbalance across > > > disks > > > >>>> of the > > > >>>> > > > same broker is independent of the load imbalance across > > brokers. > > > >>>> Both are > > > >>>> > > > guaranteed to happen in any Kafka cluster for the same > reason, > > > >>>> i.e. > > > >>>> > > > variation in the partition size. Say broker 1 have two disks > > > that > > > >>>> are 80% > > > >>>> > > > loaded and 20% loaded. And broker 2 have two disks that are > > also > > > >>>> 80% > > > >>>> > > > loaded and 20%. We can balance them without inter-broker > > traffic > > > >>>> with > > > >>>> > > > KIP-113. This is why I think KIP-113 can be very useful. > > > >>>> > > > > > > >>>> > > > Do these explanation sound reasonable? > > > >>>> > > > > > > >>>> > > > Thanks, > > > >>>> > > > Dong > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian < > > > >>>> r...@confluent.io> > > > >>>> > > > wrote: > > > >>>> > > > > > > >>>> > > >> Hey Dong, > > > >>>> > > >> > > > >>>> > > >> Thanks for the explanation. I don't think anyone is denying > > > that > > > >>>> we > > > >>>> > > should > > > >>>> > > >> rebalance at the disk level. I think it is important to > > restore > > > >>>> the disk > > > >>>> > > >> and not wait for disk replacement. There are also other > > > benefits > > > >>>> of > > > >>>> > > doing > > > >>>> > > >> that which is that you don't need to opt for hot swap racks > > > that > > > >>>> can > > > >>>> > > save > > > >>>> > > >> cost. > > > >>>> > > >> > > > >>>> > > >> The question here is what do you save by trying to add > > > >>>> complexity to > > > >>>> > > move > > > >>>> > > >> the data stored on a disk within the broker? Why would you > > not > > > >>>> simply > > > >>>> > > >> create another replica on the disk that results in a > balanced > > > >>>> load > > > >>>> > > across > > > >>>> > > >> brokers and have it catch up. We are missing a few things > > here > > > - > > > >>>> > > >> 1. What would your data balancing algorithm be? Would it > > > include > > > >>>> just > > > >>>> > > >> capacity or will it also consider throughput on disk to > > decide > > > >>>> on the > > > >>>> > > >> final > > > >>>> > > >> location of a partition? > > > >>>> > > >> 2. With 30 brokers with each having 10 disks, using the > > > >>>> rebalancing > > > >>>> > > >> algorithm, the chances of choosing disks within the broker > is > > > >>>> going to > > > >>>> > > be > > > >>>> > > >> low. This probability further decreases with more brokers > and > > > >>>> disks. > > > >>>> > > Given > > > >>>> > > >> that, why are we trying to save network cost? How much > would > > > >>>> that saving > > > >>>> > > >> be > > > >>>> > > >> if you go that route? > > > >>>> > > >> > > > >>>> > > >> These questions are hard to answer without having to verify > > > >>>> empirically. > > > >>>> > > >> My > > > >>>> > > >> suggestion is to avoid doing pre matured optimization that > > > >>>> brings in the > > > >>>> > > >> added complexity to the code and treat inter and intra > broker > > > >>>> movements > > > >>>> > > of > > > >>>> > > >> partition the same. Deploy the code, use it and see if it > is > > an > > > >>>> actual > > > >>>> > > >> problem and you get great savings by avoiding the network > > route > > > >>>> to move > > > >>>> > > >> partitions within the same broker. If so, add this > > > optimization. > > > >>>> > > >> > > > >>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin < > > lindon...@gmail.com> > > > >>>> wrote: > > > >>>> > > >> > > > >>>> > > >> > Hey Jay, Sriram, > > > >>>> > > >> > > > > >>>> > > >> > Great point. If I understand you right, you are > suggesting > > > >>>> that we can > > > >>>> > > >> > simply use RAID-0 so that the load can be evenly > > distributed > > > >>>> across > > > >>>> > > >> disks. > > > >>>> > > >> > And even though a disk failure will bring down the enter > > > >>>> broker, the > > > >>>> > > >> > reduced availability as compared to using KIP-112 and > > KIP-113 > > > >>>> will may > > > >>>> > > >> be > > > >>>> > > >> > negligible. And it may be better to just accept the > > slightly > > > >>>> reduced > > > >>>> > > >> > availability instead of introducing the complexity from > > > >>>> KIP-112 and > > > >>>> > > >> > KIP-113. > > > >>>> > > >> > > > > >>>> > > >> > Let's assume the following: > > > >>>> > > >> > > > > >>>> > > >> > - There are 30 brokers in a cluster and each broker has > 10 > > > >>>> disks > > > >>>> > > >> > - The replication factor is 3 and min.isr = 2. > > > >>>> > > >> > - The probability of annual disk failure rate is 2% > > according > > > >>>> to this > > > >>>> > > >> > <https://www.backblaze.com/blo > > g/hard-drive-failure-rates-q1- > > > >>>> 2017/> > > > >>>> > > >> blog. > > > >>>> > > >> > - It takes 3 days to replace a disk. > > > >>>> > > >> > > > > >>>> > > >> > Here is my calculation for probability of data loss due > to > > > disk > > > >>>> > > failure: > > > >>>> > > >> > probability of a given disk fails in a given year: 2% > > > >>>> > > >> > probability of a given disk stays offline for one day in > a > > > >>>> given day: > > > >>>> > > >> 2% / > > > >>>> > > >> > 365 * 3 > > > >>>> > > >> > probability of a given broker stays offline for one day > in > > a > > > >>>> given day > > > >>>> > > >> due > > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10 > > > >>>> > > >> > probability of any broker stays offline for one day in a > > > given > > > >>>> day due > > > >>>> > > >> to > > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > > > >>>> > > >> > probability of any three broker stays offline for one day > > in > > > a > > > >>>> given > > > >>>> > > day > > > >>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125% > > > >>>> > > >> > probability of data loss due to disk failure: 0.0125% > > > >>>> > > >> > > > > >>>> > > >> > Here is my calculation for probability of service > > > >>>> unavailability due > > > >>>> > > to > > > >>>> > > >> > disk failure: > > > >>>> > > >> > probability of a given disk fails in a given year: 2% > > > >>>> > > >> > probability of a given disk stays offline for one day in > a > > > >>>> given day: > > > >>>> > > >> 2% / > > > >>>> > > >> > 365 * 3 > > > >>>> > > >> > probability of a given broker stays offline for one day > in > > a > > > >>>> given day > > > >>>> > > >> due > > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10 > > > >>>> > > >> > probability of any broker stays offline for one day in a > > > given > > > >>>> day due > > > >>>> > > >> to > > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > > > >>>> > > >> > probability of any two broker stays offline for one day > in > > a > > > >>>> given day > > > >>>> > > >> due > > > >>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25% > > > >>>> > > >> > probability of unavailability due to disk failure: 0.25% > > > >>>> > > >> > > > > >>>> > > >> > Note that the unavailability due to disk failure will be > > > >>>> unacceptably > > > >>>> > > >> high > > > >>>> > > >> > in this case. And the probability of data loss due to > disk > > > >>>> failure > > > >>>> > > will > > > >>>> > > >> be > > > >>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is > > intended > > > >>>> to > > > >>>> > > achieve > > > >>>> > > >> > four nigh availability. > > > >>>> > > >> > > > > >>>> > > >> > Thanks, > > > >>>> > > >> > Dong > > > >>>> > > >> > > > > >>>> > > >> > > > > >>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps < > > j...@confluent.io > > > > > > > >>>> wrote: > > > >>>> > > >> > > > > >>>> > > >> > > I think Ram's point is that in place failure is pretty > > > >>>> complicated, > > > >>>> > > >> and > > > >>>> > > >> > > this is meant to be a cost saving feature, we should > > > >>>> construct an > > > >>>> > > >> > argument > > > >>>> > > >> > > for it grounded in data. > > > >>>> > > >> > > > > > >>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but > data > > > is > > > >>>> > > available > > > >>>> > > >> > > online), and assume it takes 3 days to get the drive > > > >>>> replaced. Say > > > >>>> > > you > > > >>>> > > >> > have > > > >>>> > > >> > > 10 drives per server. Then the expected downtime for > each > > > >>>> server is > > > >>>> > > >> > roughly > > > >>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off > > > since > > > >>>> I'm > > > >>>> > > >> ignoring > > > >>>> > > >> > > the case of multiple failures, but I don't know that > > > changes > > > >>>> it > > > >>>> > > >> much). So > > > >>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say > you > > > >>>> have 1000 > > > >>>> > > >> > servers > > > >>>> > > >> > > and they cost $3000/year fully loaded including power, > > the > > > >>>> cost of > > > >>>> > > >> the hw > > > >>>> > > >> > > amortized over it's life, etc. Then this feature saves > > you > > > >>>> $3000 on > > > >>>> > > >> your > > > >>>> > > >> > > total server cost of $3m which seems not very > worthwhile > > > >>>> compared to > > > >>>> > > >> > other > > > >>>> > > >> > > optimizations...? > > > >>>> > > >> > > > > > >>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i > > think > > > >>>> that is > > > >>>> > > >> the > > > >>>> > > >> > > type of argument that would be helpful to think about > the > > > >>>> tradeoff > > > >>>> > > in > > > >>>> > > >> > > complexity. > > > >>>> > > >> > > > > > >>>> > > >> > > -Jay > > > >>>> > > >> > > > > > >>>> > > >> > > > > > >>>> > > >> > > > > > >>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin < > > > >>>> lindon...@gmail.com> > > > >>>> > > wrote: > > > >>>> > > >> > > > > > >>>> > > >> > > > Hey Sriram, > > > >>>> > > >> > > > > > > >>>> > > >> > > > Thanks for taking time to review the KIP. Please see > > > below > > > >>>> my > > > >>>> > > >> answers > > > >>>> > > >> > to > > > >>>> > > >> > > > your questions: > > > >>>> > > >> > > > > > > >>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration and > > go > > > >>>> over what > > > >>>> > > >> is > > > >>>> > > >> > the > > > >>>> > > >> > > > >average disk/partition repair/restore time that we > are > > > >>>> targeting > > > >>>> > > >> for a > > > >>>> > > >> > > > >typical JBOD setup? > > > >>>> > > >> > > > > > > >>>> > > >> > > > We currently don't have this data. I think the > > > >>>> disk/partition > > > >>>> > > >> > > repair/store > > > >>>> > > >> > > > time depends on availability of hardware, the > response > > > >>>> time of > > > >>>> > > >> > > > site-reliability engineer, the amount of data on the > > bad > > > >>>> disk etc. > > > >>>> > > >> > These > > > >>>> > > >> > > > vary between companies and even clusters within the > > same > > > >>>> company > > > >>>> > > >> and it > > > >>>> > > >> > > is > > > >>>> > > >> > > > probably hard to determine what is the average > > situation. > > > >>>> > > >> > > > > > > >>>> > > >> > > > I am not very sure why we need this. Can you explain > a > > > bit > > > >>>> why > > > >>>> > > this > > > >>>> > > >> > data > > > >>>> > > >> > > is > > > >>>> > > >> > > > useful to evaluate the motivation and design of this > > KIP? > > > >>>> > > >> > > > > > > >>>> > > >> > > > >2. How often do we believe disks are going to fail > (in > > > >>>> your > > > >>>> > > example > > > >>>> > > >> > > > >configuration) and what do we gain by avoiding the > > > network > > > >>>> > > overhead > > > >>>> > > >> > and > > > >>>> > > >> > > > >doing all the work of moving the replica within the > > > >>>> broker to > > > >>>> > > >> another > > > >>>> > > >> > > disk > > > >>>> > > >> > > > >instead of balancing it globally? > > > >>>> > > >> > > > > > > >>>> > > >> > > > I think the chance of disk failure depends mainly on > > the > > > >>>> disk > > > >>>> > > itself > > > >>>> > > >> > > rather > > > >>>> > > >> > > > than the broker configuration. I don't have this data > > > now. > > > >>>> I will > > > >>>> > > >> ask > > > >>>> > > >> > our > > > >>>> > > >> > > > SRE whether they know the mean-time-to-fail for our > > disk. > > > >>>> What I > > > >>>> > > was > > > >>>> > > >> > told > > > >>>> > > >> > > > by SRE is that disk failure is the most common type > of > > > >>>> hardware > > > >>>> > > >> > failure. > > > >>>> > > >> > > > > > > >>>> > > >> > > > When there is disk failure, I think it is reasonable > to > > > >>>> move > > > >>>> > > >> replica to > > > >>>> > > >> > > > another broker instead of another disk on the same > > > broker. > > > >>>> The > > > >>>> > > >> reason > > > >>>> > > >> > we > > > >>>> > > >> > > > want to move replica within broker is mainly to > > optimize > > > >>>> the Kafka > > > >>>> > > >> > > cluster > > > >>>> > > >> > > > performance when we balance load across disks. > > > >>>> > > >> > > > > > > >>>> > > >> > > > In comparison to balancing replicas globally, the > > benefit > > > >>>> of > > > >>>> > > moving > > > >>>> > > >> > > replica > > > >>>> > > >> > > > within broker is that: > > > >>>> > > >> > > > > > > >>>> > > >> > > > 1) the movement is faster since it doesn't go through > > > >>>> socket or > > > >>>> > > >> rely on > > > >>>> > > >> > > the > > > >>>> > > >> > > > available network bandwidth; > > > >>>> > > >> > > > 2) much less impact on the replication traffic > between > > > >>>> broker by > > > >>>> > > not > > > >>>> > > >> > > taking > > > >>>> > > >> > > > up bandwidth between brokers. Depending on the > pattern > > of > > > >>>> traffic, > > > >>>> > > >> we > > > >>>> > > >> > may > > > >>>> > > >> > > > need to balance load across disk frequently and it is > > > >>>> necessary to > > > >>>> > > >> > > prevent > > > >>>> > > >> > > > this operation from slowing down the existing > operation > > > >>>> (e.g. > > > >>>> > > >> produce, > > > >>>> > > >> > > > consume, replication) in the Kafka cluster. > > > >>>> > > >> > > > 3) It gives us opportunity to do automatic broker > > > rebalance > > > >>>> > > between > > > >>>> > > >> > disks > > > >>>> > > >> > > > on the same broker. > > > >>>> > > >> > > > > > > >>>> > > >> > > > > > > >>>> > > >> > > > >3. Even if we had to move the replica within the > > broker, > > > >>>> why > > > >>>> > > >> cannot we > > > >>>> > > >> > > > just > > > >>>> > > >> > > > >treat it as another replica and have it go through > the > > > >>>> same > > > >>>> > > >> > replication > > > >>>> > > >> > > > >code path that we have today? The downside here is > > > >>>> obviously that > > > >>>> > > >> you > > > >>>> > > >> > > need > > > >>>> > > >> > > > >to catchup from the leader but it is completely > free! > > > >>>> What do we > > > >>>> > > >> think > > > >>>> > > >> > > is > > > >>>> > > >> > > > >the impact of the network overhead in this case? > > > >>>> > > >> > > > > > > >>>> > > >> > > > Good point. My initial proposal actually used the > > > existing > > > >>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to > > > move > > > >>>> replica > > > >>>> > > >> > > between > > > >>>> > > >> > > > disks. However, I switched to use separate thread > pool > > > >>>> after > > > >>>> > > >> discussion > > > >>>> > > >> > > > with Jun and Becket. > > > >>>> > > >> > > > > > > >>>> > > >> > > > The main argument for using separate thread pool is > to > > > >>>> actually > > > >>>> > > keep > > > >>>> > > >> > the > > > >>>> > > >> > > > design simply and easy to reason about. There are a > > > number > > > >>>> of > > > >>>> > > >> > difference > > > >>>> > > >> > > > between inter-broker replication and intra-broker > > > >>>> replication > > > >>>> > > which > > > >>>> > > >> > makes > > > >>>> > > >> > > > it cleaner to do them in separate code path. I will > > list > > > >>>> them > > > >>>> > > below: > > > >>>> > > >> > > > > > > >>>> > > >> > > > - The throttling mechanism for inter-broker > replication > > > >>>> traffic > > > >>>> > > and > > > >>>> > > >> > > > intra-broker replication traffic is different. For > > > >>>> example, we may > > > >>>> > > >> want > > > >>>> > > >> > > to > > > >>>> > > >> > > > specify per-topic quota for inter-broker replication > > > >>>> traffic > > > >>>> > > >> because we > > > >>>> > > >> > > may > > > >>>> > > >> > > > want some topic to be moved faster than other topic. > > But > > > >>>> we don't > > > >>>> > > >> care > > > >>>> > > >> > > > about priority of topics for intra-broker movement. > So > > > the > > > >>>> current > > > >>>> > > >> > > proposal > > > >>>> > > >> > > > only allows user to specify per-broker quota for > > > >>>> inter-broker > > > >>>> > > >> > replication > > > >>>> > > >> > > > traffic. > > > >>>> > > >> > > > > > > >>>> > > >> > > > - The quota value for inter-broker replication > traffic > > > and > > > >>>> > > >> intra-broker > > > >>>> > > >> > > > replication traffic is different. The available > > bandwidth > > > >>>> for > > > >>>> > > >> > > inter-broker > > > >>>> > > >> > > > replication can probably be much higher than the > > > bandwidth > > > >>>> for > > > >>>> > > >> > > inter-broker > > > >>>> > > >> > > > replication. > > > >>>> > > >> > > > > > > >>>> > > >> > > > - The ReplicaFetchThread is per broker. Intuitively, > > the > > > >>>> number of > > > >>>> > > >> > > threads > > > >>>> > > >> > > > doing intra broker data movement should be related to > > the > > > >>>> number > > > >>>> > > of > > > >>>> > > >> > disks > > > >>>> > > >> > > > in the broker, not the number of brokers in the > > cluster. > > > >>>> > > >> > > > > > > >>>> > > >> > > > - The leader replica has no ReplicaFetchThread to > start > > > >>>> with. It > > > >>>> > > >> seems > > > >>>> > > >> > > > weird to > > > >>>> > > >> > > > start one just for intra-broker replication. > > > >>>> > > >> > > > > > > >>>> > > >> > > > Because of these difference, we think it is simpler > to > > > use > > > >>>> > > separate > > > >>>> > > >> > > thread > > > >>>> > > >> > > > pool and code path so that we can configure and > > throttle > > > >>>> them > > > >>>> > > >> > separately. > > > >>>> > > >> > > > > > > >>>> > > >> > > > > > > >>>> > > >> > > > >4. What are the chances that we will be able to > > identify > > > >>>> another > > > >>>> > > >> disk > > > >>>> > > >> > to > > > >>>> > > >> > > > >balance within the broker instead of another disk on > > > >>>> another > > > >>>> > > >> broker? > > > >>>> > > >> > If > > > >>>> > > >> > > we > > > >>>> > > >> > > > >have 100's of machines, the probability of finding a > > > >>>> better > > > >>>> > > >> balance by > > > >>>> > > >> > > > >choosing another broker is much higher than > balancing > > > >>>> within the > > > >>>> > > >> > broker. > > > >>>> > > >> > > > >Could you add some info on how we are determining > > this? > > > >>>> > > >> > > > > > > >>>> > > >> > > > It is possible that we can find available space on a > > > remote > > > >>>> > > broker. > > > >>>> > > >> The > > > >>>> > > >> > > > benefit of allowing intra-broker replication is that, > > > when > > > >>>> there > > > >>>> > > are > > > >>>> > > >> > > > available space in both the current broker and a > remote > > > >>>> broker, > > > >>>> > > the > > > >>>> > > >> > > > rebalance can be completed faster with much less > impact > > > on > > > >>>> the > > > >>>> > > >> > > inter-broker > > > >>>> > > >> > > > replication or the users traffic. It is about taking > > > >>>> advantage of > > > >>>> > > >> > > locality > > > >>>> > > >> > > > when balance the load. > > > >>>> > > >> > > > > > > >>>> > > >> > > > >5. Finally, in a cloud setup where more users are > > going > > > to > > > >>>> > > >> leverage a > > > >>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this > > change > > > >>>> is not > > > >>>> > > of > > > >>>> > > >> > much > > > >>>> > > >> > > > >gain since you don't need to balance between the > > volumes > > > >>>> within > > > >>>> > > the > > > >>>> > > >> > same > > > >>>> > > >> > > > >broker. > > > >>>> > > >> > > > > > > >>>> > > >> > > > You are right. This KIP-113 is useful only if user > uses > > > >>>> JBOD. If > > > >>>> > > >> user > > > >>>> > > >> > > uses > > > >>>> > > >> > > > an extra storage layer of replication, such as > RAID-10 > > or > > > >>>> EBS, > > > >>>> > > they > > > >>>> > > >> > don't > > > >>>> > > >> > > > need KIP-112 or KIP-113. Note that user will > replicate > > > >>>> data more > > > >>>> > > >> times > > > >>>> > > >> > > than > > > >>>> > > >> > > > the replication factor of the Kafka topic if an extra > > > >>>> storage > > > >>>> > > layer > > > >>>> > > >> of > > > >>>> > > >> > > > replication is used. > > > >>>> > > >> > > > > > > >>>> > > >> > > > > > >>>> > > >> > > > > >>>> > > >> > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > >>> > > > >>> > > > >> > > > > > > > > > >