Hey Jun, I think there is one simpler design that doesn't need to add "create" flag in LeaderAndIsrRequest and also remove the need for controller to track/update which replicas are created. The idea is for each broker to persist the created replicas in per-broker-per-topic znode. When a replica is created or deleted, the broker updates the znode accordingly. When broker receives LeaderAndIsrRequest, it learns the "create" flag from its cache of these znode data. When a broker starts, it does need to read # of znode proportional to the number of topics on its disks. But controller still needs to learn about offline replicas from LeaderAndIsrResponse.
I think this is better than the current design. Do you have any concern with this design? Thanks, Dong On Thu, Feb 23, 2017 at 7:12 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Jun, > > Sure, here is my explanation. > > Design B would not work if it doesn't store created replicas in the ZK. > For example, say broker B is health when it is shutdown. At this moment no > offline replica is written in ZK for this broker. Suppose log directory is > damaged when broker is offline, then when this broker starts, it won't know > which replicas are in the bad log directory. And it won't be able to > specify those offline replicas in /failed-log-directory either. > > Let's say design B stores created replica in ZK. Then the next problem is > that, in the scenario that multiple log directories are damaged while > broker is offline, when broker starts, it won't be able to know the exact > list of offline replicas on each bad log directory. All it knows is the > offline replicas on all those bad log directories. Thus it is impossible > for broker to specify offline replicas per log directory in this scenario. > > I agree with your observation that, if admin fixes replaces dir1 with a > good empty disk but leave dir2 untouched, design A won't create replica > whereas design B can create it. But I am not sure that is a problem which > we want to optimize. It seems reasonable for admin to fix both log > directories in practice. If admin fixes only one of the two log > directories, we can say it is a partial fix and Kafka won't re-create any > offline replicas on dir1 and dir2. Similar to extra round of > LeaderAndIsrRequest in case of log failure, I think this is also a pretty > minor issue with design B. > > Thanks, > Dong > > > On Thu, Feb 23, 2017 at 6:46 PM, Jun Rao <j...@confluent.io> wrote: > >> Hi, Dong, >> >> My replies are inlined below. >> >> On Thu, Feb 23, 2017 at 4:47 PM, Dong Lin <lindon...@gmail.com> wrote: >> >> > Hey Jun, >> > >> > Thanks for you reply! Let me first comment on the things that you >> listed as >> > advantage of B over A. >> > >> > 1) No change in LeaderAndIsrRequest protocol. >> > >> > I agree with this. >> > >> > 2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK >> > writes to record the created flag. >> > >> > I don't think this is true. There will be one round of >> LeaderAndIsrRequest >> > in both A and B. In the design A controller needs to write to ZK once to >> > record this replica as created. The design B the broker needs to write >> > zookeeper once to record this replica as created. So there is same >> number >> > of LeaderAndIsrRequest and ZK writes. >> > >> > Broker needs to record created replica in design B so that when it >> > bootstraps with failed log directory, the broker can derive the offline >> > replicas as the difference between created replicas and replicas found >> on >> > good log directories. >> > >> > >> Design B actually doesn't write created replicas in ZK. When a broker >> starts up, all offline replicas are stored in the /failed-log-directory >> path in ZK. So if a replica is not there and is not in the live log >> directories either, it's never created. Does this work? >> >> >> >> > 3) Step 2. One less round of LeaderAndIsrRequest and no additional >> logic to >> > handle LeaderAndIsrResponse. >> > >> > While I agree there is one less round of LeaderAndIsrRequest in design >> B, I >> > don't think one additional LeaderAndIsrRequest to handle log directory >> > failure is a big deal given that it doesn't happen frequently. >> > >> > Also, while there is no additional logic to handle LeaderAndIsrResponse >> in >> > design B, I actually think this is something that controller should do >> > anyway. Say the broker stops responding to any requests without removing >> > itself from zookeeper, the only way for controller to realize this and >> > re-elect leader is to send request to this broker and handle response. >> The >> > is a problem that we don't do it as of now. >> > >> > 4) Step 6. Additional ZK reads proportional to # of failed log >> directories, >> > instead of # of partitions. >> > >> > If one znode is able to describe all topic partitions in a log >> directory, >> > then the existing znode /brokers/topics/[topic] should be able to >> describe >> > created replicas in addition to the assigned replicas for every >> partition >> > of the topic. In this case, design A requires no additional ZK reads >> > whereas design B ZK reads proportional to # of failed log directories. >> > >> > 5) Step 3. In design A, if a broker is restarted and the failed log >> > directory is unreadable, the broker doesn't know which replicas are on >> the >> > failed log directory. So, when the broker receives the LeadAndIsrRequest >> > with created = false, it's bit hard for the broker to decide whether it >> > should create the missing replica on other log directories. This is >> easier >> > in design B since the list of failed replicas are persisted in ZK. >> > >> > I don't understand why it is hard for broker to make decision in design >> A. >> > With design A, if a broker is started with a failed log directory and it >> > receives LeaderAndIsrRequest with created=false for a replica that can >> not >> > be found on any good log directory, broker will not create this >> replica. Is >> > there any drawback with this approach? >> > >> > >> > >> I was thinking about this case. Suppose two log directories dir1 and dir2 >> fail. The admin replaces dir1 with an empty new disk. The broker is >> restarted with dir1 alive, and dir2 still failing. Now, when receiving a >> LeaderAndIsrRequest including a replica that was previously in dir1, the >> broker won't be able to create that replica when it could. >> >> >> >> > Here is my summary of pros and cons of design B as compared to design A. >> > >> > pros: >> > >> > 1) No change to LeaderAndIsrRequest. >> > 2) One less round of LeaderAndIsrRequest in case of log directory >> failure. >> > >> > cons: >> > >> > 1) This is impossible for broker to figure out the log directory of >> offline >> > replicas for failed-log-directory/[directory] if multiple log >> directories >> > are unreadable when broker starts. >> > >> > >> Hmm, I am not sure that I get this point. If multiple log directories >> fail, >> design B stores each directory under /failed-log-directory, right? >> >> Thanks, >> >> Jun >> >> >> >> > 2) The znode size limit of failed-log-directory/[directory] essentially >> > limits the number of topic partitions that can exist on a log >> directory. It >> > becomes more of a problem when a broker is configured to use multiple >> log >> > directories each of which is a RAID-10 of large capacity. While this may >> > not be a problem in practice with additional requirement (e.g. don't use >> > more than one log directory if using RAID-10), ideally we want to avoid >> > such limit. >> > >> > 3) Extra ZK read of failed-log-directory/[directory] when broker starts >> > >> > >> > My main concern with the design B is the use of znode >> > /brokers/ids/[brokerId]/failed-log-directory/[directory]. I don't >> really >> > think other pros/cons of design B matter to us. Does my summary make >> sense? >> > >> > Thanks, >> > Dong >> > >> > >> > On Thu, Feb 23, 2017 at 2:20 PM, Jun Rao <j...@confluent.io> wrote: >> > >> > > Hi, Dong, >> > > >> > > Just so that we are on the same page. Let me spec out the alternative >> > > design a bit more and then compare. Let's call the current design A >> and >> > the >> > > alternative design B. >> > > >> > > Design B: >> > > >> > > New ZK path >> > > failed log directory path (persistent): This is created by a broker >> when >> > a >> > > log directory fails and is potentially removed when the broker is >> > > restarted. >> > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of >> the >> > > replicas in the log directory }. >> > > >> > > *1. Topic gets created* >> > > - Works the same as before. >> > > >> > > *2. A log directory stops working on a broker during runtime* >> > > >> > > - The controller watches the path /failed-log-directory for the new >> > znode. >> > > >> > > - The broker detects an offline log directory during runtime and marks >> > > affected replicas as offline in memory. >> > > >> > > - The broker writes the failed directory and all replicas in the >> failed >> > > directory under /failed-log-directory/directory1. >> > > >> > > - The controller reads /failed-log-directory/directory1 and stores in >> > > memory a list of failed replicas due to disk failures. >> > > >> > > - The controller moves those replicas due to disk failure to offline >> > state >> > > and triggers the state change in replica state machine. >> > > >> > > >> > > *3. Broker is restarted* >> > > >> > > - The broker reads /brokers/ids/[brokerId]/failed-log-directory, if >> any. >> > > >> > > - For each failed log directory it reads from ZK, if the log directory >> > > exists in log.dirs and is accessible now, or if the log directory no >> > longer >> > > exists in log.dirs, remove that log directory from >> failed-log-directory. >> > > Otherwise, the broker loads replicas in the failed log directory in >> > memory >> > > as offline. >> > > >> > > - The controller handles the failed log directory change event, if >> needed >> > > (same as #2). >> > > >> > > - The controller handles the broker registration event. >> > > >> > > >> > > *6. Controller failover* >> > > - Controller reads all child paths under /failed-log-directory to >> rebuild >> > > the list of failed replicas due to disk failures. Those replicas will >> be >> > > transitioned to the offline state during controller initialization. >> > > >> > > Comparing this with design A, I think the following are the things >> that >> > > design B simplifies. >> > > * No change in LeaderAndIsrRequest protocol. >> > > * Step 1. One less round of LeaderAndIsrRequest and no additional ZK >> > writes >> > > to record the created flag. >> > > * Step 2. One less round of LeaderAndIsrRequest and no additional >> logic >> > to >> > > handle LeaderAndIsrResponse. >> > > * Step 6. Additional ZK reads proportional to # of failed log >> > directories, >> > > instead of # of partitions. >> > > * Step 3. In design A, if a broker is restarted and the failed log >> > > directory is unreadable, the broker doesn't know which replicas are on >> > the >> > > failed log directory. So, when the broker receives the >> LeadAndIsrRequest >> > > with created = false, it's bit hard for the broker to decide whether >> it >> > > should create the missing replica on other log directories. This is >> > easier >> > > in design B since the list of failed replicas are persisted in ZK. >> > > >> > > Now, for some of the other things that you mentioned. >> > > >> > > * What happens if a log directory is renamed? >> > > I think this can be handled in the same way as non-existing log >> directory >> > > during broker restart. >> > > >> > > * What happens if replicas are moved manually across disks? >> > > Good point. Well, if all log directories are available, the failed log >> > > directory path will be cleared. In the rarer case that a log >> directory is >> > > still offline and one of the replicas registered in the failed log >> > > directory shows up in another available log directory, I am not quite >> > sure. >> > > Perhaps the simplest approach is to just error out and let the admin >> fix >> > > things manually? >> > > >> > > Thanks, >> > > >> > > Jun >> > > >> > > >> > > >> > > On Wed, Feb 22, 2017 at 3:39 PM, Dong Lin <lindon...@gmail.com> >> wrote: >> > > >> > > > Hey Jun, >> > > > >> > > > Thanks much for the explanation. I have some questions about 21 but >> > that >> > > is >> > > > less important than 20. 20 would require considerable change to the >> KIP >> > > and >> > > > probably requires weeks to discuss again. Thus I would like to be >> very >> > > sure >> > > > that we agree on the problems with the current design as you >> mentioned >> > > and >> > > > there is no foreseeable problem with the alternate design. >> > > > >> > > > Please see below I detail response. To summarize my points, I >> couldn't >> > > > figure out any non-trival drawback of the current design as >> compared to >> > > the >> > > > alternative design; and I couldn't figure out a good way to store >> > offline >> > > > replicas in the alternative design. Can you see if these points make >> > > sense? >> > > > Thanks in advance for your time!! >> > > > >> > > > >> > > > 1) The alternative design requires slightly more dependency on ZK. >> > While >> > > > both solutions store created replicas in the ZK, the alternative >> design >> > > > would also store offline replicas in ZK but the current design >> doesn't. >> > > > Thus >> > > > >> > > > 2) I am not sure that we should store offline replicas in znode >> > > > /brokers/ids/[brokerId]/failed-log-directory/[directory]. We >> probably >> > > > don't >> > > > want to expose log directory path in zookeeper based on the concept >> > that >> > > we >> > > > should only store logical information (e.g. topic, brokerId) in >> > > zookeeper's >> > > > path name. More specifically, we probably don't want to rename path >> in >> > > > zookeeper simply because user renamed a log director. And we >> probably >> > > don't >> > > > want to read/write these znode just because user manually moved >> > replicas >> > > > between log directories. >> > > > >> > > > 3) I couldn't find a good way to store offline replicas in ZK in the >> > > > alternative design. We can store this information one znode >> per-topic, >> > > > per-brokerId, or per-brokerId-topic. All these choices have their >> own >> > > > problems. If we store it in per-topic znode then multiple brokers >> may >> > > need >> > > > to read/write offline replicas in the same znode which is generally >> > bad. >> > > If >> > > > we store it per-brokerId then we effectively limit the maximum >> number >> > of >> > > > topic-partition that can be stored on a broker by the znode size >> limit. >> > > > This contradicts the idea to expand the single broker capacity by >> > > throwing >> > > > in more disks. If we store it per-brokerId-topic, then when >> controller >> > > > starts, it needs to read number of brokerId*topic znodes which may >> > double >> > > > the overall znode reads during controller startup. >> > > > >> > > > 4) The alternative design is less efficient than the current design >> in >> > > case >> > > > of log directory failure. The alternative design requires extra >> znode >> > > reads >> > > > in order to read offline replicas from zk while the current design >> > > requires >> > > > only one pair of LeaderAndIsrRequest and LeaderAndIsrResponse. The >> > extra >> > > > znode reads will be proportional to the number of topics on the >> broker >> > if >> > > > we store offline replicas per-brokerId-topic. >> > > > >> > > > 5) While I agree that the failure reporting should be done where the >> > > > failure is originated, I think the current design is consistent with >> > what >> > > > we are already doing. With the current design, broker will send >> > > > notification via zookeeper and controller will send >> LeaderAndIsrRequest >> > > to >> > > > broker. This is similar to how broker sends isr change notification >> and >> > > > controller read latest isr from broker. If we do want broker to >> report >> > > > failure directly to controller, we should probably have broker send >> RPC >> > > > directly to controller as it sends ControllerShutdownRequest. I can >> do >> > > this >> > > > as well. >> > > > >> > > > 6) I don't think the current design requires additional state >> > management >> > > in >> > > > each of the existing state handling such as topic creation or >> > controller >> > > > failover. All these existing logic should stay exactly the same >> except >> > > that >> > > > the controller should recognize offline replicas on the live broker >> > > instead >> > > > of assuming all replicas on live brokers are live. But this >> additional >> > > > change is required in both the current design and the alternate >> design. >> > > > Thus there should be no difference between current design and the >> > > alternate >> > > > design with respect to these existing state handling logic in >> > controller. >> > > > >> > > > 7) While I agree that the current design requires additional >> complexity >> > > in >> > > > the controller in order to handle LeaderAndIsrResponse and >> potentially >> > > > change partition and replica state to offline in the sate machines, >> I >> > > think >> > > > such logic is necessary in a well-designed controller either with >> the >> > > > alternate design or even without JBOD. Controller should be able to >> > > handle >> > > > error (e.g. ClusterAuthorizationException) in LeaderAndIsrResponse >> and >> > > > every responses in general. For example, if the controller hasn't >> > > received >> > > > LeaderAndIsrResponse after a given period if time, it probably means >> > the >> > > > broker has hang and the controller should consider this broker as >> > offline >> > > > and re-elect leader from other brokers. This would actually fix some >> > > > problem we have seen before at LinkedIn, where broker hangs due to >> > > > RAID-controller failure. In other words, I think it is a good idea >> for >> > > > controller to handle response. >> > > > >> > > > 8) I am not sure that the additional state management to handle >> > > > LeaderAndIsrResponse causes new types of synchronization. It is true >> > that >> > > > the logic is not handled ZK event handling >> > > > thread. But the existing ControllerShutdownRequest is also not >> handled >> > by >> > > > ZK event handling thread. The LeaderAndIsrReponse can be handled by >> the >> > > > same thread is that currently handling ControllerShutdownRequest so >> > that >> > > we >> > > > don't require new type of synchronization. Further, it should be >> rare >> > to >> > > > require additional synchronization to handle LeaderAndIsrReponse >> > because >> > > we >> > > > only need synchronization when there are offline replicas. >> > > > >> > > > >> > > > Thanks, >> > > > Dong >> > > > >> > > > On Wed, Feb 22, 2017 at 10:36 AM, Jun Rao <j...@confluent.io> wrote: >> > > > >> > > > > Hi, Dong, Jiangjie, >> > > > > >> > > > > 20. (1) I agree that ideally we'd like to use direct RPC for >> > > > > broker-to-broker communication instead of ZK. However, in the >> > > alternative >> > > > > design, the failed log directory path also serves as the >> persistent >> > > state >> > > > > for remembering the offline partitions. This is similar to the >> > > > > controller_managed_state path in your design. The difference is >> that >> > > the >> > > > > alternative design stores the state in fewer ZK paths, which helps >> > > reduce >> > > > > the controller failover time. (2) I agree that we want the >> controller >> > > to >> > > > be >> > > > > the single place to make decisions. However, intuitively, the >> failure >> > > > > reporting should be done where the failure is originated. For >> > example, >> > > > if a >> > > > > broker fails, the broker reports failure by de-registering from >> ZK. >> > The >> > > > > failed log directory path is similar in that regard. (3) I am not >> > > worried >> > > > > about the additional load from extra LeaderAndIsrRequest. What I >> > worry >> > > > > about is any unnecessary additional complexity in the controller. >> To >> > > me, >> > > > > the additional complexity in the current design is the additional >> > state >> > > > > management in each of the existing state handling (e.g., topic >> > > creation, >> > > > > controller failover, etc), and the additional synchronization >> since >> > the >> > > > > additional state management is not initiated from the ZK event >> > handling >> > > > > thread. >> > > > > >> > > > > 21. One of the reasons that we need to send a StopReplicaRequest >> to >> > > > offline >> > > > > replica is to handle controlled shutdown. In that case, a broker >> is >> > > still >> > > > > alive, but indicates to the controller that it plans to shut down. >> > > Being >> > > > > able to stop the replica in the shutting down broker reduces >> churns >> > in >> > > > ISR. >> > > > > So, for simplicity, it's probably easier to always send a >> > > > > StopReplicaRequest >> > > > > to any offline replica. >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Jun >> > > > > >> > > > > >> > > > > On Tue, Feb 21, 2017 at 2:37 PM, Dong Lin <lindon...@gmail.com> >> > wrote: >> > > > > >> > > > > > Hey Jun, >> > > > > > >> > > > > > Thanks much for your comments. >> > > > > > >> > > > > > I actually proposed the design to store both offline replicas >> and >> > > > created >> > > > > > replicas in per-broker znode before switching to the design in >> the >> > > > > current >> > > > > > KIP. The current design stores created replicas in per-partition >> > > znode >> > > > > and >> > > > > > transmits offline replicas via LeaderAndIsrResponse. The >> original >> > > > > solution >> > > > > > is roughly the same as what you suggested. The advantage of the >> > > current >> > > > > > solution is kind of philosophical: 1) we want to transmit data >> > (e.g. >> > > > > > offline replicas) using RPC and reduce dependency on zookeeper; >> 2) >> > we >> > > > > want >> > > > > > controller to be the only one that determines any state (e.g. >> > offline >> > > > > > replicas) that will be exposed to user. The advantage of the >> > solution >> > > > to >> > > > > > store offline replica in zookeeper is that we can save one >> > roundtrip >> > > > time >> > > > > > for controller to handle log directory failure. However, this >> extra >> > > > > > roundtrip time should not be a big deal since the log directory >> > > failure >> > > > > is >> > > > > > rare and inefficiency of extra latency is less of a problem when >> > > there >> > > > is >> > > > > > log directory failure. >> > > > > > >> > > > > > Do you think the two philosophical advantages of the current KIP >> > make >> > > > > > sense? If not, then I can switch to the original design that >> stores >> > > > > offline >> > > > > > replicas in zookeeper. It is actually written already. One >> > > disadvantage >> > > > > is >> > > > > > that we have to make non-trivial change the KIP (e.g. no create >> > flag >> > > in >> > > > > > LeaderAndIsrRequest and no created flag zookeeper) and restart >> this >> > > KIP >> > > > > > discussion. >> > > > > > >> > > > > > Regarding 21, it seems to me that LeaderAndIsrRequest/ >> > > > StopReplicaRequest >> > > > > > only makes sense when broker can make the choice (e.g. fetch >> data >> > for >> > > > > this >> > > > > > replica or not). In the case that the log directory of the >> replica >> > is >> > > > > > already offline, broker have to stop fetching data for this >> replica >> > > > > > regardless of what controller tells it to do. Thus it seems >> cleaner >> > > for >> > > > > > broker to stop fetch data for this replica immediately. The >> > advantage >> > > > of >> > > > > > this solution is that the controller logic is simpler since it >> > > doesn't >> > > > > need >> > > > > > to send StopReplicaRequest in case of log directory failure, and >> > the >> > > > > log4j >> > > > > > log is also cleaner. Is there specific advantage of having >> > controller >> > > > > send >> > > > > > tells broker to stop fetching data for offline replicas? >> > > > > > >> > > > > > Regarding 22, I agree with your observation that it will >> happen. I >> > > will >> > > > > > update the KIP and specify that broker will exist with proper >> error >> > > > > message >> > > > > > in the log and user needs to manually remove partitions and >> restart >> > > the >> > > > > > broker. >> > > > > > >> > > > > > Thanks! >> > > > > > Dong >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao <j...@confluent.io> >> > wrote: >> > > > > > >> > > > > > > Hi, Dong, >> > > > > > > >> > > > > > > Sorry for the delay. A few more comments. >> > > > > > > >> > > > > > > 20. One complexity that I found in the current KIP is that the >> > way >> > > > the >> > > > > > > broker communicates failed replicas to the controller is >> > > inefficient. >> > > > > > When >> > > > > > > a log directory fails, the broker only sends an indication >> > through >> > > ZK >> > > > > to >> > > > > > > the controller and the controller has to issue a >> > > LeaderAndIsrRequest >> > > > to >> > > > > > > discover which replicas are offline due to log directory >> failure. >> > > An >> > > > > > > alternative approach is that when a log directory fails, the >> > broker >> > > > > just >> > > > > > > writes the failed the directory and the corresponding topic >> > > > partitions >> > > > > > in a >> > > > > > > new failed log directory ZK path like the following. >> > > > > > > >> > > > > > > Failed log directory path: >> > > > > > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => { >> > json >> > > of >> > > > > the >> > > > > > > topic partitions in the log directory }. >> > > > > > > >> > > > > > > The controller just watches for child changes in >> > > > > > > /brokers/ids/[brokerId]/failed-log-directory. >> > > > > > > After reading this path, the broker knows the exact set of >> > replicas >> > > > > that >> > > > > > > are offline and can trigger that replica state change >> > accordingly. >> > > > This >> > > > > > > saves an extra round of LeaderAndIsrRequest handling. >> > > > > > > >> > > > > > > With this new ZK path, we get probably get rid >> > > > > of/broker/topics/[topic]/ >> > > > > > > partitions/[partitionId]/controller_managed_state. The >> creation >> > > of a >> > > > > new >> > > > > > > replica is expected to always succeed unless all log >> directories >> > > > fail, >> > > > > in >> > > > > > > which case, the broker goes down anyway. Then, during >> controller >> > > > > > failover, >> > > > > > > the controller just needs to additionally read from ZK the >> extra >> > > > failed >> > > > > > log >> > > > > > > directory paths, which is many fewer than topics or >> partitions. >> > > > > > > >> > > > > > > On broker startup, if a log directory becomes available, the >> > > > > > corresponding >> > > > > > > log directory path in ZK will be removed. >> > > > > > > >> > > > > > > The downside of this approach is that the value of this new ZK >> > path >> > > > can >> > > > > > be >> > > > > > > large. However, even with 5K partition per log directory and >> 100 >> > > > bytes >> > > > > > per >> > > > > > > partition, the size of the value is 500KB, still less than the >> > > > default >> > > > > > 1MB >> > > > > > > znode limit in ZK. >> > > > > > > >> > > > > > > 21. "Broker will remove offline replica from its replica >> fetcher >> > > > > > threads." >> > > > > > > The proposal lets the broker remove the replica from the >> replica >> > > > > fetcher >> > > > > > > thread when it detects a directory failure. An alternative is >> to >> > > only >> > > > > do >> > > > > > > that until the broker receives the LeaderAndIsrRequest/ >> > > > > > StopReplicaRequest. >> > > > > > > The benefit of this is that the controller is the only one who >> > > > decides >> > > > > > > which replica to be removed from the replica fetcher threads. >> The >> > > > > broker >> > > > > > > also doesn't need additional logic to remove the replica from >> > > replica >> > > > > > > fetcher threads. The downside is that in a small window, the >> > > replica >> > > > > > fetch >> > > > > > > thread will keep writing to the failed log directory and may >> > > pollute >> > > > > the >> > > > > > > log4j log. >> > > > > > > >> > > > > > > 22. In the current design, there is a potential corner case >> issue >> > > > that >> > > > > > the >> > > > > > > same partition may exist in more than one log directory at >> some >> > > > point. >> > > > > > > Consider the following steps: (1) a new topic t1 is created >> and >> > the >> > > > > > > controller sends LeaderAndIsrRequest to a broker; (2) the >> broker >> > > > > creates >> > > > > > > partition t1-p1 in log dir1; (3) before the broker sends a >> > > response, >> > > > it >> > > > > > > goes down; (4) the broker is restarted with log dir1 >> unreadable; >> > > (5) >> > > > > the >> > > > > > > broker receives a new LeaderAndIsrRequest and creates >> partition >> > > t1-p1 >> > > > > on >> > > > > > > log dir2; (6) at some point, the broker is restarted with log >> > dir1 >> > > > > fixed. >> > > > > > > Now partition t1-p1 is in two log dirs. The alternative >> approach >> > > > that I >> > > > > > > suggested above may suffer from a similar corner case issue. >> > Since >> > > > this >> > > > > > is >> > > > > > > rare, if the broker detects this during broker startup, it can >> > > > probably >> > > > > > > just log an error and exit. The admin can remove the redundant >> > > > > partitions >> > > > > > > manually and then restart the broker. >> > > > > > > >> > > > > > > Thanks, >> > > > > > > >> > > > > > > Jun >> > > > > > > >> > > > > > > On Sat, Feb 18, 2017 at 9:31 PM, Dong Lin < >> lindon...@gmail.com> >> > > > wrote: >> > > > > > > >> > > > > > > > Hey Jun, >> > > > > > > > >> > > > > > > > Could you please let me know if the solutions above could >> > address >> > > > > your >> > > > > > > > concern? I really want to move the discussion forward. >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > Dong >> > > > > > > > >> > > > > > > > >> > > > > > > > On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin < >> lindon...@gmail.com >> > > >> > > > > wrote: >> > > > > > > > >> > > > > > > > > Hey Jun, >> > > > > > > > > >> > > > > > > > > Thanks for all your help and time to discuss this KIP. >> When >> > you >> > > > get >> > > > > > the >> > > > > > > > > time, could you let me know if the previous answers >> address >> > the >> > > > > > > concern? >> > > > > > > > > >> > > > > > > > > I think the more interesting question in your last email >> is >> > > where >> > > > > we >> > > > > > > > > should store the "created" flag in ZK. I proposed the >> > solution >> > > > > that I >> > > > > > > > like >> > > > > > > > > most, i.e. store it together with the replica assignment >> data >> > > in >> > > > > the >> > > > > > > > /brokers/topics/[topic]. >> > > > > > > > > In order to expedite discussion, let me provide another >> two >> > > ideas >> > > > > to >> > > > > > > > > address the concern just in case the first idea doesn't >> work: >> > > > > > > > > >> > > > > > > > > - We can avoid extra controller ZK read when there is no >> disk >> > > > > failure >> > > > > > > > > (95% of time?). When controller starts, it doesn't >> > > > > > > > > read controller_managed_state in ZK and sends >> > > LeaderAndIsrRequest >> > > > > > with >> > > > > > > > > "create = false". Only if LeaderAndIsrResponse shows >> failure >> > > for >> > > > > any >> > > > > > > > > replica, then controller will read >> controller_managed_state >> > for >> > > > > this >> > > > > > > > > partition and re-send LeaderAndIsrRequset with >> "create=true" >> > if >> > > > > this >> > > > > > > > > replica has not been created. >> > > > > > > > > >> > > > > > > > > - We can significantly reduce this ZK read time by making >> > > > > > > > > controller_managed_state a topic level information in ZK, >> > e.g. >> > > > > > > > > /brokers/topics/[topic]/state. Given that most topic has >> 10+ >> > > > > > partition, >> > > > > > > > > the extra ZK read time should be less than 10% of the >> > existing >> > > > > total >> > > > > > zk >> > > > > > > > > read time during controller failover. >> > > > > > > > > >> > > > > > > > > Thanks! >> > > > > > > > > Dong >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Tue, Feb 14, 2017 at 7:30 AM, Dong Lin < >> > lindon...@gmail.com >> > > > >> > > > > > wrote: >> > > > > > > > > >> > > > > > > > >> Hey Jun, >> > > > > > > > >> >> > > > > > > > >> I just realized that you may be suggesting that a tool >> for >> > > > listing >> > > > > > > > >> offline directories is necessary for KIP-112 by asking >> > whether >> > > > > > KIP-112 >> > > > > > > > and >> > > > > > > > >> KIP-113 will be in the same release. I think such a tool >> is >> > > > useful >> > > > > > but >> > > > > > > > >> doesn't have to be included in KIP-112. This is because >> as >> > of >> > > > now >> > > > > > > admin >> > > > > > > > >> needs to log into broker machine and check broker log to >> > > figure >> > > > > out >> > > > > > > the >> > > > > > > > >> cause of broker failure and the bad log directory in >> case of >> > > > disk >> > > > > > > > failure. >> > > > > > > > >> The KIP-112 won't make it harder since admin can still >> > figure >> > > > out >> > > > > > the >> > > > > > > > bad >> > > > > > > > >> log directory by doing the same thing. Thus it is >> probably >> > OK >> > > to >> > > > > > just >> > > > > > > > >> include this script in KIP-113. Regardless, my hope is to >> > > finish >> > > > > > both >> > > > > > > > KIPs >> > > > > > > > >> ASAP and make them in the same release since both KIPs >> are >> > > > needed >> > > > > > for >> > > > > > > > the >> > > > > > > > >> JBOD setup. >> > > > > > > > >> >> > > > > > > > >> Thanks, >> > > > > > > > >> Dong >> > > > > > > > >> >> > > > > > > > >> On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin < >> > > lindon...@gmail.com> >> > > > > > > wrote: >> > > > > > > > >> >> > > > > > > > >>> And the test plan has also been updated to simulate disk >> > > > failure >> > > > > by >> > > > > > > > >>> changing log directory permission to 000. >> > > > > > > > >>> >> > > > > > > > >>> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin < >> > > lindon...@gmail.com >> > > > > >> > > > > > > wrote: >> > > > > > > > >>> >> > > > > > > > >>>> Hi Jun, >> > > > > > > > >>>> >> > > > > > > > >>>> Thanks for the reply. These comments are very helpful. >> Let >> > > me >> > > > > > answer >> > > > > > > > >>>> them inline. >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao < >> > j...@confluent.io> >> > > > > > wrote: >> > > > > > > > >>>> >> > > > > > > > >>>>> Hi, Dong, >> > > > > > > > >>>>> >> > > > > > > > >>>>> Thanks for the reply. A few more replies and new >> comments >> > > > > below. >> > > > > > > > >>>>> >> > > > > > > > >>>>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin < >> > > > lindon...@gmail.com >> > > > > > >> > > > > > > > wrote: >> > > > > > > > >>>>> >> > > > > > > > >>>>> > Hi Jun, >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > Thanks for the detailed comments. Please see answers >> > > > inline: >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao < >> > > j...@confluent.io >> > > > > >> > > > > > > wrote: >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > Hi, Dong, >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > Thanks for the updated wiki. A few comments below. >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > 1. Topics get created >> > > > > > > > >>>>> > > 1.1 Instead of storing successfully created >> replicas >> > in >> > > > ZK, >> > > > > > > could >> > > > > > > > >>>>> we >> > > > > > > > >>>>> > store >> > > > > > > > >>>>> > > unsuccessfully created replicas in ZK? Since the >> > latter >> > > > is >> > > > > > less >> > > > > > > > >>>>> common, >> > > > > > > > >>>>> > it >> > > > > > > > >>>>> > > probably reduces the load on ZK. >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > We can store unsuccessfully created replicas in ZK. >> > But I >> > > > am >> > > > > > not >> > > > > > > > >>>>> sure if >> > > > > > > > >>>>> > that can reduce write load on ZK. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > If we want to reduce write load on ZK using by store >> > > > > > > unsuccessfully >> > > > > > > > >>>>> created >> > > > > > > > >>>>> > replicas in ZK, then broker should not write to ZK >> if >> > all >> > > > > > > replicas >> > > > > > > > >>>>> are >> > > > > > > > >>>>> > successfully created. It means that if >> > > > > > > > /broker/topics/[topic]/partiti >> > > > > > > > >>>>> > ons/[partitionId]/controller_managed_state doesn't >> > exist >> > > > in >> > > > > ZK >> > > > > > > for >> > > > > > > > >>>>> a given >> > > > > > > > >>>>> > partition, we have to assume all replicas of this >> > > partition >> > > > > > have >> > > > > > > > been >> > > > > > > > >>>>> > successfully created and send LeaderAndIsrRequest >> with >> > > > > create = >> > > > > > > > >>>>> false. This >> > > > > > > > >>>>> > becomes a problem if controller crashes before >> > receiving >> > > > > > > > >>>>> > LeaderAndIsrResponse to validate whether a replica >> has >> > > been >> > > > > > > > created. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > I think this approach and reduce the number of bytes >> > > stored >> > > > > in >> > > > > > > ZK. >> > > > > > > > >>>>> But I am >> > > > > > > > >>>>> > not sure if this is a concern. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> I was mostly concerned about the controller failover >> > time. >> > > > > > > Currently, >> > > > > > > > >>>>> the >> > > > > > > > >>>>> controller failover is likely dominated by the cost of >> > > > reading >> > > > > > > > >>>>> topic/partition level information from ZK. If we add >> > > another >> > > > > > > > partition >> > > > > > > > >>>>> level path in ZK, it probably will double the >> controller >> > > > > failover >> > > > > > > > >>>>> time. If >> > > > > > > > >>>>> the approach of representing the non-created replicas >> > > doesn't >> > > > > > work, >> > > > > > > > >>>>> have >> > > > > > > > >>>>> you considered just adding the created flag in the >> > > > leaderAndIsr >> > > > > > > path >> > > > > > > > >>>>> in ZK? >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>> Yes, I have considered adding the created flag in the >> > > > > leaderAndIsr >> > > > > > > > path >> > > > > > > > >>>> in ZK. If we were to add created flag per replica in >> the >> > > > > > > > >>>> LeaderAndIsrRequest, then it requires a lot of change >> in >> > the >> > > > > code >> > > > > > > > base. >> > > > > > > > >>>> >> > > > > > > > >>>> If we don't add created flag per replica in the >> > > > > > LeaderAndIsrRequest, >> > > > > > > > >>>> then the information in leaderAndIsr path in ZK and >> > > > > > > > LeaderAndIsrRequest >> > > > > > > > >>>> would be different. Further, the procedure for broker >> to >> > > > update >> > > > > > ISR >> > > > > > > > in ZK >> > > > > > > > >>>> will be a bit complicated. When leader updates >> > leaderAndIsr >> > > > path >> > > > > > in >> > > > > > > > ZK, it >> > > > > > > > >>>> will have to first read created flags from ZK, change >> isr, >> > > and >> > > > > > write >> > > > > > > > >>>> leaderAndIsr back to ZK. And it needs to check znode >> > version >> > > > and >> > > > > > > > re-try >> > > > > > > > >>>> write operation in ZK if controller has updated ZK >> during >> > > this >> > > > > > > > period. This >> > > > > > > > >>>> is in contrast to the current implementation where the >> > > leader >> > > > > > either >> > > > > > > > gets >> > > > > > > > >>>> all the information from LeaderAndIsrRequest sent by >> > > > controller, >> > > > > > or >> > > > > > > > >>>> determine the infromation by itself (e.g. ISR), before >> > > writing >> > > > > to >> > > > > > > > >>>> leaderAndIsr path in ZK. >> > > > > > > > >>>> >> > > > > > > > >>>> It seems to me that the above solution is a bit >> > complicated >> > > > and >> > > > > > not >> > > > > > > > >>>> clean. Thus I come up with the design in this KIP to >> store >> > > > this >> > > > > > > > created >> > > > > > > > >>>> flag in a separate zk path. The path is named >> > > > > > > > controller_managed_state to >> > > > > > > > >>>> indicate that we can store in this znode all >> information >> > > that >> > > > is >> > > > > > > > managed by >> > > > > > > > >>>> controller only, as opposed to ISR. >> > > > > > > > >>>> >> > > > > > > > >>>> I agree with your concern of increased ZK read time >> during >> > > > > > > controller >> > > > > > > > >>>> failover. How about we store the "created" information >> in >> > > the >> > > > > > > > >>>> znode /brokers/topics/[topic]? We can change that >> znode to >> > > > have >> > > > > > the >> > > > > > > > >>>> following data format: >> > > > > > > > >>>> >> > > > > > > > >>>> { >> > > > > > > > >>>> "version" : 2, >> > > > > > > > >>>> "created" : { >> > > > > > > > >>>> "1" : [1, 2, 3], >> > > > > > > > >>>> ... >> > > > > > > > >>>> } >> > > > > > > > >>>> "partition" : { >> > > > > > > > >>>> "1" : [1, 2, 3], >> > > > > > > > >>>> ... >> > > > > > > > >>>> } >> > > > > > > > >>>> } >> > > > > > > > >>>> >> > > > > > > > >>>> We won't have extra zk read using this solution. It >> also >> > > seems >> > > > > > > > >>>> reasonable to put the partition assignment information >> > > > together >> > > > > > with >> > > > > > > > >>>> replica creation information. The latter is only >> changed >> > > once >> > > > > > after >> > > > > > > > the >> > > > > > > > >>>> partition is created or re-assigned. >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > 1.2 If an error is received for a follower, does >> the >> > > > > > controller >> > > > > > > > >>>>> eagerly >> > > > > > > > >>>>> > > remove it from ISR or do we just let the leader >> > removes >> > > > it >> > > > > > > after >> > > > > > > > >>>>> timeout? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > No, Controller will not actively remove it from ISR. >> > But >> > > > > > > controller >> > > > > > > > >>>>> will >> > > > > > > > >>>>> > recognize it as offline replica and propagate this >> > > > > information >> > > > > > to >> > > > > > > > all >> > > > > > > > >>>>> > brokers via UpdateMetadataRequest. Each leader can >> use >> > > this >> > > > > > > > >>>>> information to >> > > > > > > > >>>>> > actively remove offline replica from ISR set. I have >> > > > updated >> > > > > to >> > > > > > > > wiki >> > > > > > > > >>>>> to >> > > > > > > > >>>>> > clarify it. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> >> > > > > > > > >>>>> That seems inconsistent with how the controller deals >> > with >> > > > > > offline >> > > > > > > > >>>>> replicas >> > > > > > > > >>>>> due to broker failures. When that happens, the broker >> > will >> > > > (1) >> > > > > > > select >> > > > > > > > >>>>> a new >> > > > > > > > >>>>> leader if the offline replica is the leader; (2) >> remove >> > the >> > > > > > replica >> > > > > > > > >>>>> from >> > > > > > > > >>>>> ISR if the offline replica is the follower. So, >> > > intuitively, >> > > > it >> > > > > > > seems >> > > > > > > > >>>>> that >> > > > > > > > >>>>> we should be doing the same thing when dealing with >> > offline >> > > > > > > replicas >> > > > > > > > >>>>> due to >> > > > > > > > >>>>> disk failure. >> > > > > > > > >>>>> >> > > > > > > > >>>> >> > > > > > > > >>>> My bad. I misunderstand how the controller currently >> > handles >> > > > > > broker >> > > > > > > > >>>> failure and ISR change. Yes we should do the same thing >> > when >> > > > > > dealing >> > > > > > > > with >> > > > > > > > >>>> offline replicas here. I have updated the KIP to >> specify >> > > that, >> > > > > > when >> > > > > > > an >> > > > > > > > >>>> offline replica is discovered by controller, the >> > controller >> > > > > > removes >> > > > > > > > offline >> > > > > > > > >>>> replicas from ISR in the ZK and sends >> LeaderAndIsrRequest >> > > with >> > > > > > > > updated ISR >> > > > > > > > >>>> to be used by partition leaders. >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > 1.3 Similar, if an error is received for a leader, >> > > should >> > > > > the >> > > > > > > > >>>>> controller >> > > > > > > > >>>>> > > trigger leader election again? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > Yes, controller will trigger leader election if >> leader >> > > > > replica >> > > > > > is >> > > > > > > > >>>>> offline. >> > > > > > > > >>>>> > I have updated the wiki to clarify it. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > 2. A log directory stops working on a broker >> during >> > > > > runtime: >> > > > > > > > >>>>> > > 2.1 It seems the broker remembers the failed >> > directory >> > > > > after >> > > > > > > > >>>>> hitting an >> > > > > > > > >>>>> > > IOException and the failed directory won't be used >> > for >> > > > > > creating >> > > > > > > > new >> > > > > > > > >>>>> > > partitions until the broker is restarted? If so, >> > could >> > > > you >> > > > > > add >> > > > > > > > >>>>> that to >> > > > > > > > >>>>> > the >> > > > > > > > >>>>> > > wiki. >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > Right, broker assumes a log directory to be good >> after >> > it >> > > > > > starts, >> > > > > > > > >>>>> and mark >> > > > > > > > >>>>> > log directory as bad once there is IOException when >> > > broker >> > > > > > > attempts >> > > > > > > > >>>>> to >> > > > > > > > >>>>> > access the log directory. New replicas will only be >> > > created >> > > > > on >> > > > > > > good >> > > > > > > > >>>>> log >> > > > > > > > >>>>> > directory. I just added this to the KIP. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > 2.2 Could you be a bit more specific on how and >> > during >> > > > > which >> > > > > > > > >>>>> operation >> > > > > > > > >>>>> > the >> > > > > > > > >>>>> > > broker detects directory failure? Is it when the >> > broker >> > > > > hits >> > > > > > an >> > > > > > > > >>>>> > IOException >> > > > > > > > >>>>> > > during writes, or both reads and writes? For >> > example, >> > > > > during >> > > > > > > > >>>>> broker >> > > > > > > > >>>>> > > startup, it only reads from each of the log >> > > directories, >> > > > if >> > > > > > it >> > > > > > > > >>>>> hits an >> > > > > > > > >>>>> > > IOException there, does the broker immediately >> mark >> > the >> > > > > > > directory >> > > > > > > > >>>>> as >> > > > > > > > >>>>> > > offline? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > Broker marks log directory as bad once there is >> > > IOException >> > > > > > when >> > > > > > > > >>>>> broker >> > > > > > > > >>>>> > attempts to access the log directory. This includes >> > read >> > > > and >> > > > > > > write. >> > > > > > > > >>>>> These >> > > > > > > > >>>>> > operations include log append, log read, log >> cleaning, >> > > > > > watermark >> > > > > > > > >>>>> checkpoint >> > > > > > > > >>>>> > etc. If broker hits IOException when it reads from >> each >> > > of >> > > > > the >> > > > > > > log >> > > > > > > > >>>>> > directory during startup, it immediately mark the >> > > directory >> > > > > as >> > > > > > > > >>>>> offline. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > I just updated the KIP to clarify it. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > 3. Partition reassignment: If we know a replica is >> > > > offline, >> > > > > > do >> > > > > > > we >> > > > > > > > >>>>> still >> > > > > > > > >>>>> > > want to send StopReplicaRequest to it? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > No, controller doesn't send StopReplicaRequest for >> an >> > > > offline >> > > > > > > > >>>>> replica. >> > > > > > > > >>>>> > Controller treats this scenario in the same way that >> > > > exiting >> > > > > > > Kafka >> > > > > > > > >>>>> > implementation does when the broker of this replica >> is >> > > > > offline. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > 4. UpdateMetadataRequestPartitionState: For >> > > > > > offline_replicas, >> > > > > > > do >> > > > > > > > >>>>> they >> > > > > > > > >>>>> > only >> > > > > > > > >>>>> > > include offline replicas due to log directory >> > failures >> > > or >> > > > > do >> > > > > > > they >> > > > > > > > >>>>> also >> > > > > > > > >>>>> > > include offline replicas due to broker failure? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > UpdateMetadataRequestPartitionState's >> offline_replicas >> > > > > include >> > > > > > > > >>>>> offline >> > > > > > > > >>>>> > replicas due to both log directory failure and >> broker >> > > > > failure. >> > > > > > > This >> > > > > > > > >>>>> is to >> > > > > > > > >>>>> > make the semantics of this field easier to >> understand. >> > > > Broker >> > > > > > can >> > > > > > > > >>>>> > distinguish whether a replica is offline due to >> broker >> > > > > failure >> > > > > > or >> > > > > > > > >>>>> disk >> > > > > > > > >>>>> > failure by checking whether a broker is live in the >> > > > > > > > >>>>> UpdateMetadataRequest. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > 5. Tools: Could we add some kind of support in the >> > tool >> > > > to >> > > > > > list >> > > > > > > > >>>>> offline >> > > > > > > > >>>>> > > directories? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > In KIP-112 we don't have tools to list offline >> > > directories >> > > > > > > because >> > > > > > > > >>>>> we have >> > > > > > > > >>>>> > intentionally avoided exposing log directory >> > information >> > > > > (e.g. >> > > > > > > log >> > > > > > > > >>>>> > directory path) to user or other brokers. I think we >> > can >> > > > add >> > > > > > this >> > > > > > > > >>>>> feature >> > > > > > > > >>>>> > in KIP-113, in which we will have >> DescribeDirsRequest >> > to >> > > > list >> > > > > > log >> > > > > > > > >>>>> directory >> > > > > > > > >>>>> > information (e.g. partition assignment, path, size) >> > > needed >> > > > > for >> > > > > > > > >>>>> rebalance. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> Since we are introducing a new failure mode, if a >> replica >> > > > > becomes >> > > > > > > > >>>>> offline >> > > > > > > > >>>>> due to failure in log directories, the first thing an >> > admin >> > > > > wants >> > > > > > > to >> > > > > > > > >>>>> know >> > > > > > > > >>>>> is which log directories are offline from the broker's >> > > > > > perspective. >> > > > > > > > >>>>> So, >> > > > > > > > >>>>> including such a tool will be useful. Do you plan to >> do >> > > > KIP-112 >> > > > > > and >> > > > > > > > >>>>> KIP-113 >> > > > > > > > >>>>> in the same release? >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>> Yes, I agree that including such a tool is using. This >> is >> > > > > probably >> > > > > > > > >>>> better to be added in KIP-113 because we need >> > > > > DescribeDirsRequest >> > > > > > to >> > > > > > > > get >> > > > > > > > >>>> this information. I will update KIP-113 to include this >> > > tool. >> > > > > > > > >>>> >> > > > > > > > >>>> I plan to do KIP-112 and KIP-113 separately to make >> each >> > KIP >> > > > and >> > > > > > > their >> > > > > > > > >>>> patch easier to review. I don't have any plan about >> which >> > > > > release >> > > > > > to >> > > > > > > > have >> > > > > > > > >>>> these KIPs. My plan is to both of them ASAP. Is there >> > > > particular >> > > > > > > > timeline >> > > > > > > > >>>> you prefer for code of these two KIPs to checked-in? >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > 6. Metrics: Could we add some metrics to show >> offline >> > > > > > > > directories? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > Sure. I think it makes sense to have each broker >> report >> > > its >> > > > > > > number >> > > > > > > > of >> > > > > > > > >>>>> > offline replicas and offline log directories. The >> > > previous >> > > > > > metric >> > > > > > > > >>>>> was put >> > > > > > > > >>>>> > in KIP-113. I just added both metrics in KIP-112. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > 7. There are still references to >> kafka-log-dirs.sh. >> > Are >> > > > > they >> > > > > > > > valid? >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > My bad. I just removed this from "Changes in >> > Operational >> > > > > > > > Procedures" >> > > > > > > > >>>>> and >> > > > > > > > >>>>> > "Test Plan" in the KIP. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > 8. Do you think KIP-113 is ready for review? One >> > thing >> > > > that >> > > > > > > > KIP-113 >> > > > > > > > >>>>> > > mentions during partition reassignment is to first >> > send >> > > > > > > > >>>>> > > LeaderAndIsrRequest, followed by >> > > ChangeReplicaDirRequest. >> > > > > It >> > > > > > > > seems >> > > > > > > > >>>>> it's >> > > > > > > > >>>>> > > better if the replicas are created in the right >> log >> > > > > directory >> > > > > > > in >> > > > > > > > >>>>> the >> > > > > > > > >>>>> > first >> > > > > > > > >>>>> > > place? The reason that I brought it up here is >> > because >> > > it >> > > > > may >> > > > > > > > >>>>> affect the >> > > > > > > > >>>>> > > protocol of LeaderAndIsrRequest. >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > Yes, KIP-113 is ready for review. The advantage of >> the >> > > > > current >> > > > > > > > >>>>> design is >> > > > > > > > >>>>> > that we can keep LeaderAndIsrRequest >> > > > log-direcotry-agnostic. >> > > > > > The >> > > > > > > > >>>>> > implementation would be much easier to read if all >> log >> > > > > related >> > > > > > > > logic >> > > > > > > > >>>>> (e.g. >> > > > > > > > >>>>> > various errors) are put in ChangeReplicadIRrequest >> and >> > > the >> > > > > code >> > > > > > > > path >> > > > > > > > >>>>> of >> > > > > > > > >>>>> > handling replica movement is separated from >> leadership >> > > > > > handling. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > In other words, I think Kafka may be easier to >> develop >> > in >> > > > the >> > > > > > > long >> > > > > > > > >>>>> term if >> > > > > > > > >>>>> > we separate these two requests. >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > I agree that ideally we want to create replicas in >> the >> > > > right >> > > > > > log >> > > > > > > > >>>>> directory >> > > > > > > > >>>>> > in the first place. But I am not sure if there is >> any >> > > > > > performance >> > > > > > > > or >> > > > > > > > >>>>> > correctness concern with the existing way of moving >> it >> > > > after >> > > > > it >> > > > > > > is >> > > > > > > > >>>>> created. >> > > > > > > > >>>>> > Besides, does this decision affect the change >> proposed >> > in >> > > > > > > KIP-112? >> > > > > > > > >>>>> > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> I am just wondering if you have considered including >> the >> > > log >> > > > > > > > directory >> > > > > > > > >>>>> for >> > > > > > > > >>>>> the replicas in the LeaderAndIsrRequest. >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>> Yeah I have thought about this idea, but only briefly. >> I >> > > > > rejected >> > > > > > > this >> > > > > > > > >>>> idea because log directory is broker's local >> information >> > > and I >> > > > > > > prefer >> > > > > > > > not >> > > > > > > > >>>> to expose local config information to the cluster >> through >> > > > > > > > >>>> LeaderAndIsrRequest. >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>>> 9. Could you describe when the offline replicas due to >> > log >> > > > > > > directory >> > > > > > > > >>>>> failure are removed from the replica fetch threads? >> > > > > > > > >>>>> >> > > > > > > > >>>> >> > > > > > > > >>>> Yes. If the offline replica was a leader, either a new >> > > leader >> > > > is >> > > > > > > > >>>> elected or all follower brokers will stop fetching for >> > this >> > > > > > > > partition. If >> > > > > > > > >>>> the offline replica is a follower, the broker will stop >> > > > fetching >> > > > > > for >> > > > > > > > this >> > > > > > > > >>>> replica immediately. A broker stops fetching data for a >> > > > replica >> > > > > by >> > > > > > > > removing >> > > > > > > > >>>> the replica from the replica fetch threads. I have >> updated >> > > the >> > > > > KIP >> > > > > > > to >> > > > > > > > >>>> clarify it. >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> 10. The wiki mentioned changing the log directory to a >> > file >> > > > for >> > > > > > > > >>>>> simulating >> > > > > > > > >>>>> disk failure in system tests. Could we just change the >> > > > > permission >> > > > > > > of >> > > > > > > > >>>>> the >> > > > > > > > >>>>> log directory to 000 to simulate that? >> > > > > > > > >>>>> >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>> Sure, >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> Thanks, >> > > > > > > > >>>>> >> > > > > > > > >>>>> Jun >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> > > Jun >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > On Fri, Feb 10, 2017 at 9:53 AM, Dong Lin < >> > > > > > lindon...@gmail.com >> > > > > > > > >> > > > > > > > >>>>> wrote: >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > > Hi Jun, >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > Can I replace zookeeper access with direct RPC >> for >> > > both >> > > > > ISR >> > > > > > > > >>>>> > notification >> > > > > > > > >>>>> > > > and disk failure notification in a future KIP, >> or >> > do >> > > > you >> > > > > > feel >> > > > > > > > we >> > > > > > > > >>>>> should >> > > > > > > > >>>>> > > do >> > > > > > > > >>>>> > > > it in this KIP? >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > Hi Eno, Grant and everyone, >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > Is there further improvement you would like to >> see >> > > with >> > > > > > this >> > > > > > > > KIP? >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > Thanks you all for the comments, >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > Dong >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin < >> > > > > > > lindon...@gmail.com> >> > > > > > > > >>>>> wrote: >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > > On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe < >> > > > > > > > >>>>> cmcc...@apache.org> >> > > > > > > > >>>>> > > wrote: >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > >> On Thu, Feb 9, 2017, at 11:40, Dong Lin >> wrote: >> > > > > > > > >>>>> > > > >> > Thanks for all the comments Colin! >> > > > > > > > >>>>> > > > >> > >> > > > > > > > >>>>> > > > >> > To answer your questions: >> > > > > > > > >>>>> > > > >> > - Yes, a broker will shutdown if all its >> log >> > > > > > directories >> > > > > > > > >>>>> are bad. >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > >> That makes sense. Can you add this to the >> > > writeup? >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > > Sure. This has already been added. You can >> find >> > it >> > > > here >> > > > > > > > >>>>> > > > > <https://cwiki.apache.org/confluence/pages/ >> > > > > > > diffpagesbyversio >> > > > > > > > >>>>> n.action >> > > > > > > > >>>>> > ? >> > > > > > > > >>>>> > > > pageId=67638402&selectedPageVersions=9& >> > > > > > selectedPageVersions= >> > > > > > > > 10> >> > > > > > > > >>>>> > > > > . >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > >> > - I updated the KIP to explicitly state >> that a >> > > log >> > > > > > > > >>>>> directory will >> > > > > > > > >>>>> > be >> > > > > > > > >>>>> > > > >> > assumed to be good until broker sees >> > IOException >> > > > > when >> > > > > > it >> > > > > > > > >>>>> tries to >> > > > > > > > >>>>> > > > access >> > > > > > > > >>>>> > > > >> > the log directory. >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > >> Thanks. >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > >> > - Controller doesn't explicitly know >> whether >> > > there >> > > > > is >> > > > > > > new >> > > > > > > > >>>>> log >> > > > > > > > >>>>> > > > directory >> > > > > > > > >>>>> > > > >> > or >> > > > > > > > >>>>> > > > >> > not. All controller knows is whether >> replicas >> > > are >> > > > > > online >> > > > > > > > or >> > > > > > > > >>>>> > offline >> > > > > > > > >>>>> > > > >> based >> > > > > > > > >>>>> > > > >> > on LeaderAndIsrResponse. According to the >> > > existing >> > > > > > Kafka >> > > > > > > > >>>>> > > > implementation, >> > > > > > > > >>>>> > > > >> > controller will always send >> > LeaderAndIsrRequest >> > > > to a >> > > > > > > > broker >> > > > > > > > >>>>> after >> > > > > > > > >>>>> > it >> > > > > > > > >>>>> > > > >> > bounces. >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > >> I thought so. It's good to clarify, >> though. Do >> > > you >> > > > > > think >> > > > > > > > >>>>> it's >> > > > > > > > >>>>> > worth >> > > > > > > > >>>>> > > > >> adding a quick discussion of this on the >> wiki? >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > > Personally I don't think it is needed. If >> broker >> > > > starts >> > > > > > > with >> > > > > > > > >>>>> no bad >> > > > > > > > >>>>> > log >> > > > > > > > >>>>> > > > > directory, everything should work it is and we >> > > should >> > > > > not >> > > > > > > > need >> > > > > > > > >>>>> to >> > > > > > > > >>>>> > > clarify >> > > > > > > > >>>>> > > > > it. The KIP has already covered the scenario >> > when a >> > > > > > broker >> > > > > > > > >>>>> starts >> > > > > > > > >>>>> > with >> > > > > > > > >>>>> > > > bad >> > > > > > > > >>>>> > > > > log directory. Also, the KIP doesn't claim or >> > hint >> > > > that >> > > > > > we >> > > > > > > > >>>>> support >> > > > > > > > >>>>> > > > dynamic >> > > > > > > > >>>>> > > > > addition of new log directories. I think we >> are >> > > good. >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > >> best, >> > > > > > > > >>>>> > > > >> Colin >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > >> > >> > > > > > > > >>>>> > > > >> > Please see this >> > > > > > > > >>>>> > > > >> > <https://cwiki.apache.org/conf >> > > > > > > > >>>>> luence/pages/diffpagesbyversio >> > > > > > > > >>>>> > > > >> n.action?pageId=67638402& >> > selectedPageVersions=9& >> > > > > > > > >>>>> > > > selectedPageVersions=10> >> > > > > > > > >>>>> > > > >> > for the change of the KIP. >> > > > > > > > >>>>> > > > >> > >> > > > > > > > >>>>> > > > >> > On Thu, Feb 9, 2017 at 11:04 AM, Colin >> McCabe >> > < >> > > > > > > > >>>>> cmcc...@apache.org >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > > >> wrote: >> > > > > > > > >>>>> > > > >> > >> > > > > > > > >>>>> > > > >> > > On Thu, Feb 9, 2017, at 11:03, Colin >> McCabe >> > > > wrote: >> > > > > > > > >>>>> > > > >> > > > Thanks, Dong L. >> > > > > > > > >>>>> > > > >> > > > >> > > > > > > > >>>>> > > > >> > > > Do we plan on bringing down the broker >> > > process >> > > > > > when >> > > > > > > > all >> > > > > > > > >>>>> log >> > > > > > > > >>>>> > > > >> directories >> > > > > > > > >>>>> > > > >> > > > are offline? >> > > > > > > > >>>>> > > > >> > > > >> > > > > > > > >>>>> > > > >> > > > Can you explicitly state on the KIP >> that >> > the >> > > > log >> > > > > > > dirs >> > > > > > > > >>>>> are all >> > > > > > > > >>>>> > > > >> considered >> > > > > > > > >>>>> > > > >> > > > good after the broker process is >> bounced? >> > > It >> > > > > > seems >> > > > > > > > >>>>> like an >> > > > > > > > >>>>> > > > >> important >> > > > > > > > >>>>> > > > >> > > > thing to be clear about. Also, perhaps >> > > > discuss >> > > > > > how >> > > > > > > > the >> > > > > > > > >>>>> > > controller >> > > > > > > > >>>>> > > > >> > > > becomes aware of the newly good log >> > > > directories >> > > > > > > after >> > > > > > > > a >> > > > > > > > >>>>> broker >> > > > > > > > >>>>> > > > >> bounce >> > > > > > > > >>>>> > > > >> > > > (and whether this triggers >> re-election). >> > > > > > > > >>>>> > > > >> > > >> > > > > > > > >>>>> > > > >> > > I meant to write, all the log dirs where >> the >> > > > > broker >> > > > > > > can >> > > > > > > > >>>>> still >> > > > > > > > >>>>> > read >> > > > > > > > >>>>> > > > the >> > > > > > > > >>>>> > > > >> > > index and some other files. Clearly, log >> > dirs >> > > > > that >> > > > > > > are >> > > > > > > > >>>>> > completely >> > > > > > > > >>>>> > > > >> > > inaccessible will still be considered bad >> > > after >> > > > a >> > > > > > > broker >> > > > > > > > >>>>> process >> > > > > > > > >>>>> > > > >> bounce. >> > > > > > > > >>>>> > > > >> > > >> > > > > > > > >>>>> > > > >> > > best, >> > > > > > > > >>>>> > > > >> > > Colin >> > > > > > > > >>>>> > > > >> > > >> > > > > > > > >>>>> > > > >> > > > >> > > > > > > > >>>>> > > > >> > > > +1 (non-binding) aside from that >> > > > > > > > >>>>> > > > >> > > > >> > > > > > > > >>>>> > > > >> > > > >> > > > > > > > >>>>> > > > >> > > > >> > > > > > > > >>>>> > > > >> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin >> > > wrote: >> > > > > > > > >>>>> > > > >> > > > > Hi all, >> > > > > > > > >>>>> > > > >> > > > > >> > > > > > > > >>>>> > > > >> > > > > Thank you all for the helpful >> > suggestion. >> > > I >> > > > > have >> > > > > > > > >>>>> updated the >> > > > > > > > >>>>> > > KIP >> > > > > > > > >>>>> > > > >> to >> > > > > > > > >>>>> > > > >> > > > > address >> > > > > > > > >>>>> > > > >> > > > > the comments received so far. See >> here >> > > > > > > > >>>>> > > > >> > > > > <https://cwiki.apache.org/conf >> > > > > > > > >>>>> > luence/pages/diffpagesbyversio >> > > > > > > > >>>>> > > > >> n.action? >> > > > > > > > >>>>> > > > >> > > pageId=67638402&selectedPageVe >> > > > > > > > >>>>> rsions=8&selectedPageVersions= >> > > > > > > > >>>>> > 9>to >> > > > > > > > >>>>> > > > >> > > > > read the changes of the KIP. Here is >> a >> > > > summary >> > > > > > of >> > > > > > > > >>>>> change: >> > > > > > > > >>>>> > > > >> > > > > >> > > > > > > > >>>>> > > > >> > > > > - Updated the Proposed Change >> section to >> > > > > change >> > > > > > > the >> > > > > > > > >>>>> recovery >> > > > > > > > >>>>> > > > >> steps. >> > > > > > > > >>>>> > > > >> > > After >> > > > > > > > >>>>> > > > >> > > > > this change, broker will also create >> > > replica >> > > > > as >> > > > > > > long >> > > > > > > > >>>>> as all >> > > > > > > > >>>>> > > log >> > > > > > > > >>>>> > > > >> > > > > directories >> > > > > > > > >>>>> > > > >> > > > > are working. >> > > > > > > > >>>>> > > > >> > > > > - Removed kafka-log-dirs.sh from this >> > KIP >> > > > > since >> > > > > > > user >> > > > > > > > >>>>> no >> > > > > > > > >>>>> > longer >> > > > > > > > >>>>> > > > >> needs to >> > > > > > > > >>>>> > > > >> > > > > use >> > > > > > > > >>>>> > > > >> > > > > it for recovery from bad disks. >> > > > > > > > >>>>> > > > >> > > > > - Explained how the znode >> > > > > > controller_managed_state >> > > > > > > > is >> > > > > > > > >>>>> > managed >> > > > > > > > >>>>> > > in >> > > > > > > > >>>>> > > > >> the >> > > > > > > > >>>>> > > > >> > > > > Public >> > > > > > > > >>>>> > > > >> > > > > interface section. >> > > > > > > > >>>>> > > > >> > > > > - Explained what happens during >> > controller >> > > > > > > failover, >> > > > > > > > >>>>> > partition >> > > > > > > > >>>>> > > > >> > > > > reassignment >> > > > > > > > >>>>> > > > >> > > > > and topic deletion in the Proposed >> > Change >> > > > > > section. >> > > > > > > > >>>>> > > > >> > > > > - Updated Future Work section to >> include >> > > the >> > > > > > > > following >> > > > > > > > >>>>> > > potential >> > > > > > > > >>>>> > > > >> > > > > improvements >> > > > > > > > >>>>> > > > >> > > > > - Let broker notify controller of >> ISR >> > > > change >> > > > > > and >> > > > > > > > >>>>> disk >> > > > > > > > >>>>> > state >> > > > > > > > >>>>> > > > >> change >> > > > > > > > >>>>> > > > >> > > via >> > > > > > > > >>>>> > > > >> > > > > RPC instead of using zookeeper >> > > > > > > > >>>>> > > > >> > > > > - Handle various failure scenarios >> > (e.g. >> > > > > slow >> > > > > > > > disk) >> > > > > > > > >>>>> on a >> > > > > > > > >>>>> > > > >> case-by-case >> > > > > > > > >>>>> > > > >> > > > > basis. For example, we may want to >> > detect >> > > > slow >> > > > > > > disk >> > > > > > > > >>>>> and >> > > > > > > > >>>>> > > consider >> > > > > > > > >>>>> > > > >> it as >> > > > > > > > >>>>> > > > >> > > > > offline. >> > > > > > > > >>>>> > > > >> > > > > - Allow admin to mark a directory >> as >> > bad >> > > > so >> > > > > > that >> > > > > > > > it >> > > > > > > > >>>>> will >> > > > > > > > >>>>> > not >> > > > > > > > >>>>> > > > be >> > > > > > > > >>>>> > > > >> used. >> > > > > > > > >>>>> > > > >> > > > > >> > > > > > > > >>>>> > > > >> > > > > Thanks, >> > > > > > > > >>>>> > > > >> > > > > Dong >> > > > > > > > >>>>> > > > >> > > > > >> > > > > > > > >>>>> > > > >> > > > > >> > > > > > > > >>>>> > > > >> > > > > >> > > > > > > > >>>>> > > > >> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong >> > Lin < >> > > > > > > > >>>>> > lindon...@gmail.com >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > >> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> > > > > > > > >>>>> > > > >> > > > > > Hey Eno, >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > Thanks much for the comment! >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > I still think the complexity added >> to >> > > > Kafka >> > > > > is >> > > > > > > > >>>>> justified >> > > > > > > > >>>>> > by >> > > > > > > > >>>>> > > > its >> > > > > > > > >>>>> > > > >> > > benefit. >> > > > > > > > >>>>> > > > >> > > > > > Let me provide my reasons below. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > 1) The additional logic is easy to >> > > > > understand >> > > > > > > and >> > > > > > > > >>>>> thus its >> > > > > > > > >>>>> > > > >> complexity >> > > > > > > > >>>>> > > > >> > > > > > should be reasonable. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > On the broker side, it needs to >> catch >> > > > > > exception >> > > > > > > > when >> > > > > > > > >>>>> > access >> > > > > > > > >>>>> > > > log >> > > > > > > > >>>>> > > > >> > > directory, >> > > > > > > > >>>>> > > > >> > > > > > mark log directory and all its >> > replicas >> > > as >> > > > > > > > offline, >> > > > > > > > >>>>> notify >> > > > > > > > >>>>> > > > >> > > controller by >> > > > > > > > >>>>> > > > >> > > > > > writing the zookeeper notification >> > path, >> > > > and >> > > > > > > > >>>>> specify error >> > > > > > > > >>>>> > > in >> > > > > > > > >>>>> > > > >> > > > > > LeaderAndIsrResponse. On the >> > controller >> > > > > side, >> > > > > > it >> > > > > > > > >>>>> will >> > > > > > > > >>>>> > > listener >> > > > > > > > >>>>> > > > >> to >> > > > > > > > >>>>> > > > >> > > > > > zookeeper for disk failure >> > notification, >> > > > > learn >> > > > > > > > about >> > > > > > > > >>>>> > offline >> > > > > > > > >>>>> > > > >> > > replicas in >> > > > > > > > >>>>> > > > >> > > > > > the LeaderAndIsrResponse, and take >> > > offline >> > > > > > > > replicas >> > > > > > > > >>>>> into >> > > > > > > > >>>>> > > > >> > > consideration when >> > > > > > > > >>>>> > > > >> > > > > > electing leaders. It also mark >> replica >> > > as >> > > > > > > created >> > > > > > > > in >> > > > > > > > >>>>> > > zookeeper >> > > > > > > > >>>>> > > > >> and >> > > > > > > > >>>>> > > > >> > > use it >> > > > > > > > >>>>> > > > >> > > > > > to determine whether a replica is >> > > created. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > That is all the logic we need to >> add >> > in >> > > > > > Kafka. I >> > > > > > > > >>>>> > personally >> > > > > > > > >>>>> > > > feel >> > > > > > > > >>>>> > > > >> > > this is >> > > > > > > > >>>>> > > > >> > > > > > easy to reason about. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > 2) The additional code is not much. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > I expect the code for KIP-112 to be >> > > around >> > > > > > 1100 >> > > > > > > > >>>>> lines new >> > > > > > > > >>>>> > > > code. >> > > > > > > > >>>>> > > > >> > > Previously >> > > > > > > > >>>>> > > > >> > > > > > I have implemented a prototype of a >> > > > slightly >> > > > > > > > >>>>> different >> > > > > > > > >>>>> > > design >> > > > > > > > >>>>> > > > >> (see >> > > > > > > > >>>>> > > > >> > > here >> > > > > > > > >>>>> > > > >> > > > > > <https://docs.google.com/docum >> > > > > > > > >>>>> ent/d/1Izza0SBmZMVUBUt9s_ >> > > > > > > > >>>>> > > > >> > > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>) >> > > > > > > > >>>>> > > > >> > > > > > and uploaded it to github (see here >> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/lindong28/ >> > > > > kafka/tree/JBOD >> > > > > > > >). >> > > > > > > > >>>>> The >> > > > > > > > >>>>> > patch >> > > > > > > > >>>>> > > > >> changed >> > > > > > > > >>>>> > > > >> > > 33 >> > > > > > > > >>>>> > > > >> > > > > > files, added 1185 lines and deleted >> > 183 >> > > > > lines. >> > > > > > > The >> > > > > > > > >>>>> size of >> > > > > > > > >>>>> > > > >> prototype >> > > > > > > > >>>>> > > > >> > > patch >> > > > > > > > >>>>> > > > >> > > > > > is actually smaller than patch of >> > > KIP-107 >> > > > > (see >> > > > > > > > here >> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/apache/ >> > > > kafka/pull/2476 >> > > > > >) >> > > > > > > > which >> > > > > > > > >>>>> is >> > > > > > > > >>>>> > > already >> > > > > > > > >>>>> > > > >> > > accepted. >> > > > > > > > >>>>> > > > >> > > > > > The KIP-107 patch changed 49 files, >> > > added >> > > > > 1349 >> > > > > > > > >>>>> lines and >> > > > > > > > >>>>> > > > >> deleted 141 >> > > > > > > > >>>>> > > > >> > > lines. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > 3) Comparison with >> > > > > > > one-broker-per-multiple-volume >> > > > > > > > s >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > This KIP can improve the >> availability >> > of >> > > > > Kafka >> > > > > > > in >> > > > > > > > >>>>> this >> > > > > > > > >>>>> > case >> > > > > > > > >>>>> > > > such >> > > > > > > > >>>>> > > > >> > > that one >> > > > > > > > >>>>> > > > >> > > > > > failed volume doesn't bring down >> the >> > > > entire >> > > > > > > > broker. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > 4) Comparison with >> > one-broker-per-volume >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to multiple >> disks, >> > > > then >> > > > > we >> > > > > > > > >>>>> still have >> > > > > > > > >>>>> > > > >> similar >> > > > > > > > >>>>> > > > >> > > problem >> > > > > > > > >>>>> > > > >> > > > > > such that the broker will fail if >> any >> > > disk >> > > > > of >> > > > > > > the >> > > > > > > > >>>>> volume >> > > > > > > > >>>>> > > > failed. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to one disk, it >> > > means >> > > > > that >> > > > > > > we >> > > > > > > > >>>>> need to >> > > > > > > > >>>>> > > > >> deploy 10 >> > > > > > > > >>>>> > > > >> > > > > > brokers on a machine if the machine >> > has >> > > 10 >> > > > > > > disks. >> > > > > > > > I >> > > > > > > > >>>>> will >> > > > > > > > >>>>> > > > >> explain the >> > > > > > > > >>>>> > > > >> > > > > > concern with this approach in >> order of >> > > > their >> > > > > > > > >>>>> importance. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > - It is weird if we were to tell >> kafka >> > > > user >> > > > > to >> > > > > > > > >>>>> deploy 50 >> > > > > > > > >>>>> > > > >> brokers on a >> > > > > > > > >>>>> > > > >> > > > > > machine of 50 disks. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > - Either when user deploys Kafka >> on a >> > > > > > commercial >> > > > > > > > >>>>> cloud >> > > > > > > > >>>>> > > > platform >> > > > > > > > >>>>> > > > >> or >> > > > > > > > >>>>> > > > >> > > when >> > > > > > > > >>>>> > > > >> > > > > > user deploys their own cluster, the >> > size >> > > > or >> > > > > > > > largest >> > > > > > > > >>>>> disk >> > > > > > > > >>>>> > is >> > > > > > > > >>>>> > > > >> usually >> > > > > > > > >>>>> > > > >> > > > > > limited. There will be scenarios >> where >> > > > user >> > > > > > want >> > > > > > > > to >> > > > > > > > >>>>> > increase >> > > > > > > > >>>>> > > > >> broker >> > > > > > > > >>>>> > > > >> > > > > > capacity by having multiple disks >> per >> > > > > broker. >> > > > > > > This >> > > > > > > > >>>>> JBOD >> > > > > > > > >>>>> > KIP >> > > > > > > > >>>>> > > > >> makes it >> > > > > > > > >>>>> > > > >> > > > > > feasible without hurting >> availability >> > > due >> > > > to >> > > > > > > > single >> > > > > > > > >>>>> disk >> > > > > > > > >>>>> > > > >> failure. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > - Automatic load rebalance across >> > disks >> > > > will >> > > > > > be >> > > > > > > > >>>>> easier and >> > > > > > > > >>>>> > > > more >> > > > > > > > >>>>> > > > >> > > flexible >> > > > > > > > >>>>> > > > >> > > > > > if one broker has multiple disks. >> This >> > > can >> > > > > be >> > > > > > > > >>>>> future work. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > - There is performance concern when >> > you >> > > > > deploy >> > > > > > > 10 >> > > > > > > > >>>>> broker >> > > > > > > > >>>>> > > vs. 1 >> > > > > > > > >>>>> > > > >> > > broker on >> > > > > > > > >>>>> > > > >> > > > > > one machine. The metadata the >> cluster, >> > > > > > including >> > > > > > > > >>>>> > > FetchRequest, >> > > > > > > > >>>>> > > > >> > > > > > ProduceResponse, MetadataRequest >> and >> > so >> > > on >> > > > > > will >> > > > > > > > all >> > > > > > > > >>>>> be 10X >> > > > > > > > >>>>> > > > >> more. The >> > > > > > > > >>>>> > > > >> > > > > > packet-per-second will be 10X >> higher >> > > which >> > > > > may >> > > > > > > > limit >> > > > > > > > >>>>> > > > >> performance if >> > > > > > > > >>>>> > > > >> > > pps is >> > > > > > > > >>>>> > > > >> > > > > > the performance bottleneck. The >> number >> > > of >> > > > > > socket >> > > > > > > > on >> > > > > > > > >>>>> the >> > > > > > > > >>>>> > > > machine >> > > > > > > > >>>>> > > > >> is >> > > > > > > > >>>>> > > > >> > > 10X >> > > > > > > > >>>>> > > > >> > > > > > higher. And the number of >> replication >> > > > thread >> > > > > > > will >> > > > > > > > >>>>> be 100X >> > > > > > > > >>>>> > > > more. >> > > > > > > > >>>>> > > > >> The >> > > > > > > > >>>>> > > > >> > > impact >> > > > > > > > >>>>> > > > >> > > > > > will be more significant with >> > increasing >> > > > > > number >> > > > > > > of >> > > > > > > > >>>>> disks >> > > > > > > > >>>>> > per >> > > > > > > > >>>>> > > > >> > > machine. Thus >> > > > > > > > >>>>> > > > >> > > > > > it will limit Kakfa's scalability >> in >> > the >> > > > > long >> > > > > > > > term. >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > Thanks, >> > > > > > > > >>>>> > > > >> > > > > > Dong >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > > On Tue, Feb 7, 2017 at 1:51 AM, Eno >> > > > > Thereska < >> > > > > > > > >>>>> > > > >> eno.there...@gmail.com >> > > > > > > > >>>>> > > > >> > > > >> > > > > > > > >>>>> > > > >> > > > > > wrote: >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > > > >> Hi Dong, >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> To simplify the discussion today, >> on >> > my >> > > > > part >> > > > > > > I'll >> > > > > > > > >>>>> zoom >> > > > > > > > >>>>> > into >> > > > > > > > >>>>> > > > one >> > > > > > > > >>>>> > > > >> > > thing >> > > > > > > > >>>>> > > > >> > > > > >> only: >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> - I'll discuss the options called >> > > below : >> > > > > > > > >>>>> > > > >> "one-broker-per-disk" or >> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks". >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> - I completely buy the JBOD vs >> RAID >> > > > > arguments >> > > > > > > so >> > > > > > > > >>>>> there is >> > > > > > > > >>>>> > > no >> > > > > > > > >>>>> > > > >> need to >> > > > > > > > >>>>> > > > >> > > > > >> discuss that part for me. I buy it >> > that >> > > > > JBODs >> > > > > > > are >> > > > > > > > >>>>> good. >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> I find the terminology can be >> > improved >> > > a >> > > > > bit. >> > > > > > > > >>>>> Ideally >> > > > > > > > >>>>> > we'd >> > > > > > > > >>>>> > > be >> > > > > > > > >>>>> > > > >> > > talking >> > > > > > > > >>>>> > > > >> > > > > >> about volumes, not disks. Just to >> > make >> > > it >> > > > > > clear >> > > > > > > > >>>>> that >> > > > > > > > >>>>> > Kafka >> > > > > > > > >>>>> > > > >> > > understand >> > > > > > > > >>>>> > > > >> > > > > >> volumes/directories, not >> individual >> > raw >> > > > > > disks. >> > > > > > > So >> > > > > > > > >>>>> by >> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks" what I >> > mean >> > > is >> > > > > > that >> > > > > > > > the >> > > > > > > > >>>>> admin >> > > > > > > > >>>>> > > can >> > > > > > > > >>>>> > > > >> pool a >> > > > > > > > >>>>> > > > >> > > few >> > > > > > > > >>>>> > > > >> > > > > >> disks together to create a >> > > > volume/directory >> > > > > > and >> > > > > > > > >>>>> give that >> > > > > > > > >>>>> > > to >> > > > > > > > >>>>> > > > >> Kafka. >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> The kernel of my question will be >> > that >> > > > the >> > > > > > > admin >> > > > > > > > >>>>> already >> > > > > > > > >>>>> > > has >> > > > > > > > >>>>> > > > >> tools >> > > > > > > > >>>>> > > > >> > > to 1) >> > > > > > > > >>>>> > > > >> > > > > >> create volumes/directories from a >> > JBOD >> > > > and >> > > > > 2) >> > > > > > > > >>>>> start a >> > > > > > > > >>>>> > > broker >> > > > > > > > >>>>> > > > >> on a >> > > > > > > > >>>>> > > > >> > > desired >> > > > > > > > >>>>> > > > >> > > > > >> machine and 3) assign a broker >> > > resources >> > > > > > like a >> > > > > > > > >>>>> > directory. >> > > > > > > > >>>>> > > I >> > > > > > > > >>>>> > > > >> claim >> > > > > > > > >>>>> > > > >> > > that >> > > > > > > > >>>>> > > > >> > > > > >> those tools are sufficient to >> > optimise >> > > > > > resource >> > > > > > > > >>>>> > allocation. >> > > > > > > > >>>>> > > > I >> > > > > > > > >>>>> > > > >> > > understand >> > > > > > > > >>>>> > > > >> > > > > >> that a broker could manage point >> 3) >> > > > itself, >> > > > > > ie >> > > > > > > > >>>>> juggle the >> > > > > > > > >>>>> > > > >> > > directories. My >> > > > > > > > >>>>> > > > >> > > > > >> question is whether the complexity >> > > added >> > > > to >> > > > > > > Kafka >> > > > > > > > >>>>> is >> > > > > > > > >>>>> > > > justified. >> > > > > > > > >>>>> > > > >> > > > > >> Operationally it seems to me an >> admin >> > > > will >> > > > > > > still >> > > > > > > > >>>>> have to >> > > > > > > > >>>>> > do >> > > > > > > > >>>>> > > > >> all the >> > > > > > > > >>>>> > > > >> > > three >> > > > > > > > >>>>> > > > >> > > > > >> items above. >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> Looking forward to the discussion >> > > > > > > > >>>>> > > > >> > > > > >> Thanks >> > > > > > > > >>>>> > > > >> > > > > >> Eno >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> > On 1 Feb 2017, at 17:21, Dong >> Lin < >> > > > > > > > >>>>> lindon...@gmail.com >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > > > >> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > Hey Eno, >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > Thanks much for the review. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > I think your suggestion is to >> split >> > > > disks >> > > > > > of >> > > > > > > a >> > > > > > > > >>>>> machine >> > > > > > > > >>>>> > > into >> > > > > > > > >>>>> > > > >> > > multiple >> > > > > > > > >>>>> > > > >> > > > > >> disk >> > > > > > > > >>>>> > > > >> > > > > >> > sets and run one broker per disk >> > set. >> > > > > Yeah >> > > > > > > this >> > > > > > > > >>>>> is >> > > > > > > > >>>>> > > similar >> > > > > > > > >>>>> > > > to >> > > > > > > > >>>>> > > > >> > > Colin's >> > > > > > > > >>>>> > > > >> > > > > >> > suggestion of >> one-broker-per-disk, >> > > > which >> > > > > we >> > > > > > > > have >> > > > > > > > >>>>> > > evaluated >> > > > > > > > >>>>> > > > at >> > > > > > > > >>>>> > > > >> > > LinkedIn >> > > > > > > > >>>>> > > > >> > > > > >> and >> > > > > > > > >>>>> > > > >> > > > > >> > considered it to be a good short >> > term >> > > > > > > approach. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > As of now I don't think any of >> > these >> > > > > > approach >> > > > > > > > is >> > > > > > > > >>>>> a >> > > > > > > > >>>>> > better >> > > > > > > > >>>>> > > > >> > > alternative in >> > > > > > > > >>>>> > > > >> > > > > >> > the long term. I will summarize >> > these >> > > > > > here. I >> > > > > > > > >>>>> have put >> > > > > > > > >>>>> > > > these >> > > > > > > > >>>>> > > > >> > > reasons in >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> > KIP's motivation section and >> > rejected >> > > > > > > > alternative >> > > > > > > > >>>>> > > section. >> > > > > > > > >>>>> > > > I >> > > > > > > > >>>>> > > > >> am >> > > > > > > > >>>>> > > > >> > > happy to >> > > > > > > > >>>>> > > > >> > > > > >> > discuss more and I would >> certainly >> > > like >> > > > > to >> > > > > > > use >> > > > > > > > an >> > > > > > > > >>>>> > > > alternative >> > > > > > > > >>>>> > > > >> > > solution >> > > > > > > > >>>>> > > > >> > > > > >> that >> > > > > > > > >>>>> > > > >> > > > > >> > is easier to do with better >> > > > performance. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-10: if we switch >> > from >> > > > > > RAID-10 >> > > > > > > > >>>>> with >> > > > > > > > >>>>> > > > >> > > > > >> replication-factoer=2 to >> > > > > > > > >>>>> > > > >> > > > > >> > JBOD with replicatio-factor=3, >> we >> > get >> > > > 25% >> > > > > > > > >>>>> reduction in >> > > > > > > > >>>>> > > disk >> > > > > > > > >>>>> > > > >> usage >> > > > > > > > >>>>> > > > >> > > and >> > > > > > > > >>>>> > > > >> > > > > >> > doubles the tolerance of broker >> > > failure >> > > > > > > before >> > > > > > > > >>>>> data >> > > > > > > > >>>>> > > > >> > > unavailability from >> > > > > > > > >>>>> > > > >> > > > > >> 1 >> > > > > > > > >>>>> > > > >> > > > > >> > to 2. This is pretty huge gain >> for >> > > any >> > > > > > > company >> > > > > > > > >>>>> that >> > > > > > > > >>>>> > uses >> > > > > > > > >>>>> > > > >> Kafka at >> > > > > > > > >>>>> > > > >> > > large >> > > > > > > > >>>>> > > > >> > > > > >> > scale. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. one-broker-per-disk: >> The >> > > > > benefit >> > > > > > > of >> > > > > > > > >>>>> > > > >> > > one-broker-per-disk is >> > > > > > > > >>>>> > > > >> > > > > >> that >> > > > > > > > >>>>> > > > >> > > > > >> > no major code change is needed >> in >> > > > Kafka. >> > > > > > > Among >> > > > > > > > >>>>> the >> > > > > > > > >>>>> > > > >> disadvantage of >> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk summarized >> in >> > the >> > > > KIP >> > > > > > and >> > > > > > > > >>>>> previous >> > > > > > > > >>>>> > > > email >> > > > > > > > >>>>> > > > >> with >> > > > > > > > >>>>> > > > >> > > Colin, >> > > > > > > > >>>>> > > > >> > > > > >> > the biggest one is the 15% >> > throughput >> > > > > loss >> > > > > > > > >>>>> compared to >> > > > > > > > >>>>> > > JBOD >> > > > > > > > >>>>> > > > >> and >> > > > > > > > >>>>> > > > >> > > less >> > > > > > > > >>>>> > > > >> > > > > >> > flexibility to balance across >> > disks. >> > > > > > Further, >> > > > > > > > it >> > > > > > > > >>>>> > probably >> > > > > > > > >>>>> > > > >> requires >> > > > > > > > >>>>> > > > >> > > > > >> change >> > > > > > > > >>>>> > > > >> > > > > >> > to internal deployment tools at >> > > various >> > > > > > > > >>>>> companies to >> > > > > > > > >>>>> > deal >> > > > > > > > >>>>> > > > >> with >> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk setup. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-0: This is the >> > setup >> > > > that >> > > > > > > used >> > > > > > > > at >> > > > > > > > >>>>> > > > Microsoft. >> > > > > > > > >>>>> > > > >> The >> > > > > > > > >>>>> > > > >> > > > > >> problem is >> > > > > > > > >>>>> > > > >> > > > > >> > that a broker becomes >> unavailable >> > if >> > > > any >> > > > > > disk >> > > > > > > > >>>>> fail. >> > > > > > > > >>>>> > > Suppose >> > > > > > > > >>>>> > > > >> > > > > >> > replication-factor=2 and there >> are >> > 10 >> > > > > disks >> > > > > > > per >> > > > > > > > >>>>> > machine. >> > > > > > > > >>>>> > > > >> Then the >> > > > > > > > >>>>> > > > >> > > > > >> > probability of of any message >> > becomes >> > > > > > > > >>>>> unavailable due >> > > > > > > > >>>>> > to >> > > > > > > > >>>>> > > > disk >> > > > > > > > >>>>> > > > >> > > failure >> > > > > > > > >>>>> > > > >> > > > > >> with >> > > > > > > > >>>>> > > > >> > > > > >> > RAID-0 is 100X higher than that >> > with >> > > > > JBOD. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. >> > one-broker-per-few-disks: >> > > > > > > > >>>>> > > > one-broker-per-few-disk >> > > > > > > > >>>>> > > > >> is >> > > > > > > > >>>>> > > > >> > > > > >> somewhere >> > > > > > > > >>>>> > > > >> > > > > >> > between one-broker-per-disk and >> > > RAID-0. >> > > > > So >> > > > > > it >> > > > > > > > >>>>> carries >> > > > > > > > >>>>> > an >> > > > > > > > >>>>> > > > >> averaged >> > > > > > > > >>>>> > > > >> > > > > >> > disadvantages of these two >> > > approaches. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > To answer your question >> regarding, >> > I >> > > > > think >> > > > > > it >> > > > > > > > is >> > > > > > > > >>>>> > > reasonable >> > > > > > > > >>>>> > > > >> to >> > > > > > > > >>>>> > > > >> > > mange >> > > > > > > > >>>>> > > > >> > > > > >> disk >> > > > > > > > >>>>> > > > >> > > > > >> > in Kafka. By "managing disks" we >> > mean >> > > > the >> > > > > > > > >>>>> management of >> > > > > > > > >>>>> > > > >> > > assignment of >> > > > > > > > >>>>> > > > >> > > > > >> > replicas across disks. Here are >> my >> > > > > reasons >> > > > > > in >> > > > > > > > >>>>> more >> > > > > > > > >>>>> > > detail: >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > - I don't think this KIP is a >> big >> > > step >> > > > > > > change. >> > > > > > > > By >> > > > > > > > >>>>> > > allowing >> > > > > > > > >>>>> > > > >> user to >> > > > > > > > >>>>> > > > >> > > > > >> > configure Kafka to run multiple >> log >> > > > > > > directories >> > > > > > > > >>>>> or >> > > > > > > > >>>>> > disks >> > > > > > > > >>>>> > > as >> > > > > > > > >>>>> > > > >> of >> > > > > > > > >>>>> > > > >> > > now, it >> > > > > > > > >>>>> > > > >> > > > > >> is >> > > > > > > > >>>>> > > > >> > > > > >> > implicit that Kafka manages >> disks. >> > It >> > > > is >> > > > > > just >> > > > > > > > >>>>> not a >> > > > > > > > >>>>> > > > complete >> > > > > > > > >>>>> > > > >> > > feature. >> > > > > > > > >>>>> > > > >> > > > > >> > Microsoft and probably other >> > > companies >> > > > > are >> > > > > > > > using >> > > > > > > > >>>>> this >> > > > > > > > >>>>> > > > feature >> > > > > > > > >>>>> > > > >> > > under the >> > > > > > > > >>>>> > > > >> > > > > >> > undesirable effect that a broker >> > will >> > > > > fail >> > > > > > > any >> > > > > > > > >>>>> if any >> > > > > > > > >>>>> > > disk >> > > > > > > > >>>>> > > > >> fail. >> > > > > > > > >>>>> > > > >> > > It is >> > > > > > > > >>>>> > > > >> > > > > >> good >> > > > > > > > >>>>> > > > >> > > > > >> > to complete this feature. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > - I think it is reasonable to >> > manage >> > > > disk >> > > > > > in >> > > > > > > > >>>>> Kafka. One >> > > > > > > > >>>>> > > of >> > > > > > > > >>>>> > > > >> the >> > > > > > > > >>>>> > > > >> > > most >> > > > > > > > >>>>> > > > >> > > > > >> > important work that Kafka is >> doing >> > is >> > > > to >> > > > > > > > >>>>> determine the >> > > > > > > > >>>>> > > > >> replica >> > > > > > > > >>>>> > > > >> > > > > >> assignment >> > > > > > > > >>>>> > > > >> > > > > >> > across brokers and make sure >> enough >> > > > > copies >> > > > > > > of a >> > > > > > > > >>>>> given >> > > > > > > > >>>>> > > > >> replica is >> > > > > > > > >>>>> > > > >> > > > > >> available. >> > > > > > > > >>>>> > > > >> > > > > >> > I would argue that it is not >> much >> > > > > different >> > > > > > > > than >> > > > > > > > >>>>> > > > determining >> > > > > > > > >>>>> > > > >> the >> > > > > > > > >>>>> > > > >> > > replica >> > > > > > > > >>>>> > > > >> > > > > >> > assignment across disk >> > conceptually. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > - I would agree that this KIP is >> > > > improve >> > > > > > > > >>>>> performance of >> > > > > > > > >>>>> > > > >> Kafka at >> > > > > > > > >>>>> > > > >> > > the >> > > > > > > > >>>>> > > > >> > > > > >> cost >> > > > > > > > >>>>> > > > >> > > > > >> > of more complexity inside >> Kafka, by >> > > > > > switching >> > > > > > > > >>>>> from >> > > > > > > > >>>>> > > RAID-10 >> > > > > > > > >>>>> > > > to >> > > > > > > > >>>>> > > > >> > > JBOD. I >> > > > > > > > >>>>> > > > >> > > > > >> would >> > > > > > > > >>>>> > > > >> > > > > >> > argue that this is a right >> > direction. >> > > > If >> > > > > we >> > > > > > > can >> > > > > > > > >>>>> gain >> > > > > > > > >>>>> > 20%+ >> > > > > > > > >>>>> > > > >> > > performance by >> > > > > > > > >>>>> > > > >> > > > > >> > managing NIC in Kafka as >> compared >> > to >> > > > > > existing >> > > > > > > > >>>>> approach >> > > > > > > > >>>>> > > and >> > > > > > > > >>>>> > > > >> other >> > > > > > > > >>>>> > > > >> > > > > >> > alternatives, I would say we >> should >> > > > just >> > > > > do >> > > > > > > it. >> > > > > > > > >>>>> Such a >> > > > > > > > >>>>> > > gain >> > > > > > > > >>>>> > > > >> in >> > > > > > > > >>>>> > > > >> > > > > >> performance, >> > > > > > > > >>>>> > > > >> > > > > >> > or equivalently reduction in >> cost, >> > > can >> > > > > save >> > > > > > > > >>>>> millions of >> > > > > > > > >>>>> > > > >> dollars >> > > > > > > > >>>>> > > > >> > > per year >> > > > > > > > >>>>> > > > >> > > > > >> > for any company running Kafka at >> > > large >> > > > > > scale. >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > Thanks, >> > > > > > > > >>>>> > > > >> > > > > >> > Dong >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> > On Wed, Feb 1, 2017 at 5:41 AM, >> Eno >> > > > > > Thereska >> > > > > > > < >> > > > > > > > >>>>> > > > >> > > eno.there...@gmail.com> >> > > > > > > > >>>>> > > > >> > > > > >> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> > >> > > > > > > > >>>>> > > > >> > > > > >> >> I'm coming somewhat late to the >> > > > > > discussion, >> > > > > > > > >>>>> apologies >> > > > > > > > >>>>> > > for >> > > > > > > > >>>>> > > > >> that. >> > > > > > > > >>>>> > > > >> > > > > >> >> >> > > > > > > > >>>>> > > > >> > > > > >> >> I'm worried about this >> proposal. >> > > It's >> > > > > > moving >> > > > > > > > >>>>> Kafka to >> > > > > > > > >>>>> > a >> > > > > > > > >>>>> > > > >> world >> > > > > > > > >>>>> > > > >> > > where it >> > > > > > > > >>>>> > > > >> > > > > >> >> manages disks. So in a sense, >> the >> > > > scope >> > > > > of >> > > > > > > the >> > > > > > > > >>>>> KIP is >> > > > > > > > >>>>> > > > >> limited, >> > > > > > > > >>>>> > > > >> > > but the >> > > > > > > > >>>>> > > > >> > > > > >> >> direction it sets for Kafka is >> > > quite a >> > > > > big >> > > > > > > > step >> > > > > > > > >>>>> > change. >> > > > > > > > >>>>> > > > >> > > Fundamentally >> > > > > > > > >>>>> > > > >> > > > > >> this >> > > > > > > > >>>>> > > > >> > > > > >> >> is about balancing resources >> for a >> > > > Kafka >> > > > > > > > >>>>> broker. This >> > > > > > > > >>>>> > > can >> > > > > > > > >>>>> > > > be >> > > > > > > > >>>>> > > > >> > > done by a >> > > > > > > > >>>>> > > > >> > > > > >> >> tool, rather than by changing >> > Kafka. >> > > > > E.g., >> > > > > > > the >> > > > > > > > >>>>> tool >> > > > > > > > >>>>> > > would >> > > > > > > > >>>>> > > > >> take a >> > > > > > > > >>>>> > > > >> > > bunch >> > > > > > > > >>>>> > > > >> > > > > >> of >> > > > > > > > >>>>> > > > >> > > > > >> >> disks together, create a volume >> > over >> > > > > them >> > > > > > > and >> > > > > > > > >>>>> export >> > > > > > > > >>>>> > > that >> > > > > > > > >>>>> > > > >> to a >> > > > > > > > >>>>> > > > >> > > Kafka >> > > > > > > > >>>>> > > > >> > > > > >> broker >> > > > > > > > >>>>> > > > >> > > > > >> >> (in addition to setting the >> memory >> > > > > limits >> > > > > > > for >> > > > > > > > >>>>> that >> > > > > > > > >>>>> > > broker >> > > > > > > > >>>>> > > > or >> > > > > > > > >>>>> > > > >> > > limiting >> > > > > > > > >>>>> > > > >> > > > > >> other >> > > > > > > > >>>>> > > > >> > > > > >> >> resources). A different bunch >> of >> > > disks >> > > > > can >> > > > > > > > then >> > > > > > > > >>>>> make >> > > > > > > > >>>>> > up >> > > > > > > > >>>>> > > a >> > > > > > > > >>>>> > > > >> second >> > > > > > > > >>>>> > > > >> > > > > >> volume, >> > > > > > > > >>>>> > > > >> > > > > >> >> and be used by another Kafka >> > broker. >> > > > > This >> > > > > > is >> > > > > > > > >>>>> aligned >> > > > > > > > >>>>> > > with >> > > > > > > > >>>>> > > > >> what >> > > > > > > > >>>>> > > > >> > > Colin is >> > > > > > > > >>>>> > > > >> > > > > >> >> saying (as I understand it). >> > > > > > > > >>>>> > > > >> > > > > >> >> >> > > > > > > > >>>>> > > > >> > > > > >> >> Disks are not the only resource >> > on a >> > > > > > > machine, >> > > > > > > > >>>>> there >> > > > > > > > >>>>> > are >> > > > > > > > >>>>> > > > >> several >> > > > > > > > >>>>> > > > >> > > > > >> instances >> > > > > > > > >>>>> > > > >> > > > > >> >> where multiple NICs are used >> for >> > > > > example. >> > > > > > Do >> > > > > > > > we >> > > > > > > > >>>>> want >> > > > > > > > >>>>> > > fine >> > > > > > > > >>>>> > > > >> grained >> > > > > > > > >>>>> > > > >> > > > > >> >> management of all these >> resources? >> > > I'd >> > > > > > argue >> > > > > > > > >>>>> that >> > > > > > > > >>>>> > opens >> > > > > > > > >>>>> > > us >> > > > > > > > >>>>> > > > >> the >> > > > > > > > >>>>> > > > >> > > system >> > > > > > > > >>>>> > > > >> > > > > >> to a >> > > > > > > > >>>>> > > > >> > > > > >> >> lot of complexity. >> > > > > > > > >>>>> > > > >> > > > > >> >> >> > > > > > > > >>>>> > > > >> > > > > >> >> Thanks >> > > > > > > > >>>>> > > > >> > > > > >> >> Eno >> > > > > > > > >>>>> > > > >> > > > > >> >> >> > > > > > > > >>>>> > > > >> > > > > >> >> >> > > > > > > > >>>>> > > > >> > > > > >> >>> On 1 Feb 2017, at 01:53, Dong >> > Lin < >> > > > > > > > >>>>> > lindon...@gmail.com >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > > >> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> >>> >> > > > > > > > >>>>> > > > >> > > > > >> >>> Hi all, >> > > > > > > > >>>>> > > > >> > > > > >> >>> >> > > > > > > > >>>>> > > > >> > > > > >> >>> I am going to initiate the >> vote >> > If >> > > > > there >> > > > > > is >> > > > > > > > no >> > > > > > > > >>>>> > further >> > > > > > > > >>>>> > > > >> concern >> > > > > > > > >>>>> > > > >> > > with >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> >> KIP. >> > > > > > > > >>>>> > > > >> > > > > >> >>> >> > > > > > > > >>>>> > > > >> > > > > >> >>> Thanks, >> > > > > > > > >>>>> > > > >> > > > > >> >>> Dong >> > > > > > > > >>>>> > > > >> > > > > >> >>> >> > > > > > > > >>>>> > > > >> > > > > >> >>> >> > > > > > > > >>>>> > > > >> > > > > >> >>> On Fri, Jan 27, 2017 at 8:08 >> PM, >> > > > radai >> > > > > < >> > > > > > > > >>>>> > > > >> > > radai.rosenbl...@gmail.com> >> > > > > > > > >>>>> > > > >> > > > > >> >> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> >>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> a few extra points: >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 1. broker per disk might also >> > > incur >> > > > > more >> > > > > > > > >>>>> client <--> >> > > > > > > > >>>>> > > > >> broker >> > > > > > > > >>>>> > > > >> > > sockets: >> > > > > > > > >>>>> > > > >> > > > > >> >>>> suppose every producer / >> > consumer >> > > > > > "talks" >> > > > > > > to >> > > > > > > > >>>>> >1 >> > > > > > > > >>>>> > > > partition, >> > > > > > > > >>>>> > > > >> > > there's a >> > > > > > > > >>>>> > > > >> > > > > >> >> very >> > > > > > > > >>>>> > > > >> > > > > >> >>>> good chance that partitions >> that >> > > > were >> > > > > > > > >>>>> co-located on >> > > > > > > > >>>>> > a >> > > > > > > > >>>>> > > > >> single >> > > > > > > > >>>>> > > > >> > > 10-disk >> > > > > > > > >>>>> > > > >> > > > > >> >> broker >> > > > > > > > >>>>> > > > >> > > > > >> >>>> would now be split between >> > several >> > > > > > > > single-disk >> > > > > > > > >>>>> > broker >> > > > > > > > >>>>> > > > >> > > processes on >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> >> same >> > > > > > > > >>>>> > > > >> > > > > >> >>>> machine. hard to put a >> > multiplier >> > > on >> > > > > > this, >> > > > > > > > but >> > > > > > > > >>>>> > likely >> > > > > > > > >>>>> > > > >x1. >> > > > > > > > >>>>> > > > >> > > sockets >> > > > > > > > >>>>> > > > >> > > > > >> are a >> > > > > > > > >>>>> > > > >> > > > > >> >>>> limited resource at the OS >> level >> > > and >> > > > > > incur >> > > > > > > > >>>>> some >> > > > > > > > >>>>> > memory >> > > > > > > > >>>>> > > > >> cost >> > > > > > > > >>>>> > > > >> > > (kernel >> > > > > > > > >>>>> > > > >> > > > > >> >>>> buffers) >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2. there's a memory overhead >> to >> > > > > spinning >> > > > > > > up >> > > > > > > > a >> > > > > > > > >>>>> JVM >> > > > > > > > >>>>> > > > >> (compiled >> > > > > > > > >>>>> > > > >> > > code and >> > > > > > > > >>>>> > > > >> > > > > >> >> byte >> > > > > > > > >>>>> > > > >> > > > > >> >>>> code objects etc). if we >> assume >> > > this >> > > > > > > > overhead >> > > > > > > > >>>>> is >> > > > > > > > >>>>> > ~300 >> > > > > > > > >>>>> > > MB >> > > > > > > > >>>>> > > > >> > > (order of >> > > > > > > > >>>>> > > > >> > > > > >> >>>> magnitude, specifics vary) >> than >> > > > > spinning >> > > > > > > up >> > > > > > > > >>>>> 10 JVMs >> > > > > > > > >>>>> > > > would >> > > > > > > > >>>>> > > > >> lose >> > > > > > > > >>>>> > > > >> > > you 3 >> > > > > > > > >>>>> > > > >> > > > > >> GB >> > > > > > > > >>>>> > > > >> > > > > >> >> of >> > > > > > > > >>>>> > > > >> > > > > >> >>>> RAM. not a ton, but non >> > > negligible. >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 3. there would also be some >> > > overhead >> > > > > > > > >>>>> downstream of >> > > > > > > > >>>>> > > kafka >> > > > > > > > >>>>> > > > >> in any >> > > > > > > > >>>>> > > > >> > > > > >> >> management >> > > > > > > > >>>>> > > > >> > > > > >> >>>> / monitoring / log >> aggregation >> > > > system. >> > > > > > > > likely >> > > > > > > > >>>>> less >> > > > > > > > >>>>> > > than >> > > > > > > > >>>>> > > > >> x10 >> > > > > > > > >>>>> > > > >> > > though. >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 4. (related to above) - added >> > > > > complexity >> > > > > > > of >> > > > > > > > >>>>> > > > administration >> > > > > > > > >>>>> > > > >> > > with more >> > > > > > > > >>>>> > > > >> > > > > >> >>>> running instances. >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> is anyone running kafka with >> > > > anywhere >> > > > > > near >> > > > > > > > >>>>> 100GB >> > > > > > > > >>>>> > > heaps? >> > > > > > > > >>>>> > > > i >> > > > > > > > >>>>> > > > >> > > thought the >> > > > > > > > >>>>> > > > >> > > > > >> >> point >> > > > > > > > >>>>> > > > >> > > > > >> >>>> was to rely on kernel page >> cache >> > > to >> > > > do >> > > > > > the >> > > > > > > > >>>>> disk >> > > > > > > > >>>>> > > > buffering >> > > > > > > > >>>>> > > > >> .... >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> On Thu, Jan 26, 2017 at 11:00 >> > AM, >> > > > Dong >> > > > > > > Lin < >> > > > > > > > >>>>> > > > >> > > lindon...@gmail.com> >> > > > > > > > >>>>> > > > >> > > > > >> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Hey Colin, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Thanks much for the comment. >> > > Please >> > > > > see >> > > > > > > me >> > > > > > > > >>>>> comment >> > > > > > > > >>>>> > > > >> inline. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> On Thu, Jan 26, 2017 at >> 10:15 >> > AM, >> > > > > Colin >> > > > > > > > >>>>> McCabe < >> > > > > > > > >>>>> > > > >> > > cmcc...@apache.org> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> On Wed, Jan 25, 2017, at >> > 13:50, >> > > > Dong >> > > > > > Lin >> > > > > > > > >>>>> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Hey Colin, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Good point! Yeah we have >> > > actually >> > > > > > > > >>>>> considered and >> > > > > > > > >>>>> > > > >> tested this >> > > > > > > > >>>>> > > > >> > > > > >> >>>> solution, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> which we call >> > > > one-broker-per-disk. >> > > > > It >> > > > > > > > >>>>> would work >> > > > > > > > >>>>> > > and >> > > > > > > > >>>>> > > > >> should >> > > > > > > > >>>>> > > > >> > > > > >> require >> > > > > > > > >>>>> > > > >> > > > > >> >>>> no >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> major change in Kafka as >> > > compared >> > > > > to >> > > > > > > this >> > > > > > > > >>>>> JBOD >> > > > > > > > >>>>> > KIP. >> > > > > > > > >>>>> > > > So >> > > > > > > > >>>>> > > > >> it >> > > > > > > > >>>>> > > > >> > > would >> > > > > > > > >>>>> > > > >> > > > > >> be a >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> good >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> short term solution. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> But it has a few drawbacks >> > > which >> > > > > > makes >> > > > > > > it >> > > > > > > > >>>>> less >> > > > > > > > >>>>> > > > >> desirable in >> > > > > > > > >>>>> > > > >> > > the >> > > > > > > > >>>>> > > > >> > > > > >> long >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> term. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Assume we have 10 disks >> on a >> > > > > machine. >> > > > > > > > Here >> > > > > > > > >>>>> are >> > > > > > > > >>>>> > the >> > > > > > > > >>>>> > > > >> problems: >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Hi Dong, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Thanks for the thoughtful >> > reply. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 1) Our stress test result >> > shows >> > > > > that >> > > > > > > > >>>>> > > > >> one-broker-per-disk >> > > > > > > > >>>>> > > > >> > > has 15% >> > > > > > > > >>>>> > > > >> > > > > >> >>>> lower >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> throughput >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 2) Controller would need >> to >> > > send >> > > > > 10X >> > > > > > as >> > > > > > > > >>>>> many >> > > > > > > > >>>>> > > > >> > > LeaderAndIsrRequest, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> MetadataUpdateRequest and >> > > > > > > > >>>>> StopReplicaRequest. >> > > > > > > > >>>>> > This >> > > > > > > > >>>>> > > > >> > > increases the >> > > > > > > > >>>>> > > > >> > > > > >> >>>> burden >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> on >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> controller which can be >> the >> > > > > > performance >> > > > > > > > >>>>> > bottleneck. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Maybe I'm misunderstanding >> > > > > something, >> > > > > > > but >> > > > > > > > >>>>> there >> > > > > > > > >>>>> > > would >> > > > > > > > >>>>> > > > >> not be >> > > > > > > > >>>>> > > > >> > > 10x as >> > > > > > > > >>>>> > > > >> > > > > >> >>>> many >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> StopReplicaRequest RPCs, >> would >> > > > > there? >> > > > > > > The >> > > > > > > > >>>>> other >> > > > > > > > >>>>> > > > >> requests >> > > > > > > > >>>>> > > > >> > > would >> > > > > > > > >>>>> > > > >> > > > > >> >>>> increase >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> 10x, but from a pretty low >> > base, >> > > > > > right? >> > > > > > > > We >> > > > > > > > >>>>> are >> > > > > > > > >>>>> > not >> > > > > > > > >>>>> > > > >> > > reassigning >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> partitions all the time, I >> > hope >> > > > (or >> > > > > > else >> > > > > > > > we >> > > > > > > > >>>>> have >> > > > > > > > >>>>> > > > bigger >> > > > > > > > >>>>> > > > >> > > > > >> problems...) >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I think the controller will >> > group >> > > > > > > > >>>>> > StopReplicaRequest >> > > > > > > > >>>>> > > > per >> > > > > > > > >>>>> > > > >> > > broker and >> > > > > > > > >>>>> > > > >> > > > > >> >> send >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> only one StopReplicaRequest >> to >> > a >> > > > > broker >> > > > > > > > >>>>> during >> > > > > > > > >>>>> > > > controlled >> > > > > > > > >>>>> > > > >> > > shutdown. >> > > > > > > > >>>>> > > > >> > > > > >> >>>> Anyway, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we don't have to worry about >> > this >> > > > if >> > > > > we >> > > > > > > > >>>>> agree that >> > > > > > > > >>>>> > > > other >> > > > > > > > >>>>> > > > >> > > requests >> > > > > > > > >>>>> > > > >> > > > > >> will >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> increase by 10X. One >> > > > MetadataRequest >> > > > > to >> > > > > > > > send >> > > > > > > > >>>>> to >> > > > > > > > >>>>> > each >> > > > > > > > >>>>> > > > >> broker >> > > > > > > > >>>>> > > > >> > > in the >> > > > > > > > >>>>> > > > >> > > > > >> >>>> cluster >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> every time there is >> leadership >> > > > > change. >> > > > > > I >> > > > > > > am >> > > > > > > > >>>>> not >> > > > > > > > >>>>> > sure >> > > > > > > > >>>>> > > > >> this is >> > > > > > > > >>>>> > > > >> > > a real >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> problem. But in theory this >> > makes >> > > > the >> > > > > > > > >>>>> overhead >> > > > > > > > >>>>> > > > complexity >> > > > > > > > >>>>> > > > >> > > O(number >> > > > > > > > >>>>> > > > >> > > > > >> of >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> broker) and may be a >> concern in >> > > the >> > > > > > > future. >> > > > > > > > >>>>> Ideally >> > > > > > > > >>>>> > > we >> > > > > > > > >>>>> > > > >> should >> > > > > > > > >>>>> > > > >> > > avoid >> > > > > > > > >>>>> > > > >> > > > > >> it. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 3) Less efficient use of >> > > physical >> > > > > > > > resource >> > > > > > > > >>>>> on the >> > > > > > > > >>>>> > > > >> machine. >> > > > > > > > >>>>> > > > >> > > The >> > > > > > > > >>>>> > > > >> > > > > >> number >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> of >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> socket on each machine >> will >> > > > > increase >> > > > > > by >> > > > > > > > >>>>> 10X. The >> > > > > > > > >>>>> > > > >> number of >> > > > > > > > >>>>> > > > >> > > > > >> connection >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between any two machine >> will >> > > > > increase >> > > > > > > by >> > > > > > > > >>>>> 100X. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 4) Less efficient way to >> > > > management >> > > > > > > > memory >> > > > > > > > >>>>> and >> > > > > > > > >>>>> > > quota. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 5) Rebalance between >> > > > disks/brokers >> > > > > on >> > > > > > > the >> > > > > > > > >>>>> same >> > > > > > > > >>>>> > > > machine >> > > > > > > > >>>>> > > > >> will >> > > > > > > > >>>>> > > > >> > > less >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> efficient >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> and less flexible. Broker >> has >> > > to >> > > > > read >> > > > > > > > data >> > > > > > > > >>>>> from >> > > > > > > > >>>>> > > > another >> > > > > > > > >>>>> > > > >> > > broker on >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> same >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> machine via socket. It is >> > also >> > > > > harder >> > > > > > > to >> > > > > > > > do >> > > > > > > > >>>>> > > automatic >> > > > > > > > >>>>> > > > >> load >> > > > > > > > >>>>> > > > >> > > balance >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> disks on the same machine >> in >> > > the >> > > > > > > future. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I will put this and the >> > > > explanation >> > > > > > in >> > > > > > > > the >> > > > > > > > >>>>> > rejected >> > > > > > > > >>>>> > > > >> > > alternative >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> section. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> have a few questions: >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Can you explain why this >> > > > solution >> > > > > > can >> > > > > > > > >>>>> help >> > > > > > > > >>>>> > avoid >> > > > > > > > >>>>> > > > >> > > scalability >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> bottleneck? >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I actually think it will >> > > > exacerbate >> > > > > > the >> > > > > > > > >>>>> > scalability >> > > > > > > > >>>>> > > > >> problem >> > > > > > > > >>>>> > > > >> > > due >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2) >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> above. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Why can we push more RPC >> > with >> > > > > this >> > > > > > > > >>>>> solution? >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> To really answer this >> question >> > > > we'd >> > > > > > have >> > > > > > > > to >> > > > > > > > >>>>> take a >> > > > > > > > >>>>> > > > deep >> > > > > > > > >>>>> > > > >> dive >> > > > > > > > >>>>> > > > >> > > into >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> locking of the broker and >> > figure >> > > > out >> > > > > > how >> > > > > > > > >>>>> > effectively >> > > > > > > > >>>>> > > > it >> > > > > > > > >>>>> > > > >> can >> > > > > > > > >>>>> > > > >> > > > > >> >> parallelize >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> truly independent requests. >> > > > Almost >> > > > > > > every >> > > > > > > > >>>>> > > > multithreaded >> > > > > > > > >>>>> > > > >> > > process is >> > > > > > > > >>>>> > > > >> > > > > >> >>>> going >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> to have shared state, like >> > > shared >> > > > > > queues >> > > > > > > > or >> > > > > > > > >>>>> shared >> > > > > > > > >>>>> > > > >> sockets, >> > > > > > > > >>>>> > > > >> > > that is >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> going to make scaling less >> > than >> > > > > linear >> > > > > > > > when >> > > > > > > > >>>>> you >> > > > > > > > >>>>> > add >> > > > > > > > >>>>> > > > >> disks or >> > > > > > > > >>>>> > > > >> > > > > >> >>>> processors. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> (And clearly, another >> option >> > is >> > > to >> > > > > > > improve >> > > > > > > > >>>>> that >> > > > > > > > >>>>> > > > >> scalability, >> > > > > > > > >>>>> > > > >> > > rather >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> than going multi-process!) >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Yeah I also think it is >> better >> > to >> > > > > > improve >> > > > > > > > >>>>> > scalability >> > > > > > > > >>>>> > > > >> inside >> > > > > > > > >>>>> > > > >> > > kafka >> > > > > > > > >>>>> > > > >> > > > > >> code >> > > > > > > > >>>>> > > > >> > > > > >> >>>> if >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> possible. I am not sure we >> > > > currently >> > > > > > have >> > > > > > > > any >> > > > > > > > >>>>> > > > scalability >> > > > > > > > >>>>> > > > >> > > issue >> > > > > > > > >>>>> > > > >> > > > > >> inside >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Kafka that can not be >> removed >> > > > without >> > > > > > > using >> > > > > > > > >>>>> > > > >> multi-process. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - It is true that a >> garbage >> > > > > > collection >> > > > > > > in >> > > > > > > > >>>>> one >> > > > > > > > >>>>> > > broker >> > > > > > > > >>>>> > > > >> would >> > > > > > > > >>>>> > > > >> > > not >> > > > > > > > >>>>> > > > >> > > > > >> affect >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> others. But that is after >> > every >> > > > > > broker >> > > > > > > > >>>>> only uses >> > > > > > > > >>>>> > > 1/10 >> > > > > > > > >>>>> > > > >> of the >> > > > > > > > >>>>> > > > >> > > > > >> memory. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Can >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> we be sure that this will >> > > > actually >> > > > > > help >> > > > > > > > >>>>> > > performance? >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> The big question is, how >> much >> > > > memory >> > > > > > do >> > > > > > > > >>>>> Kafka >> > > > > > > > >>>>> > > brokers >> > > > > > > > >>>>> > > > >> use >> > > > > > > > >>>>> > > > >> > > now, and >> > > > > > > > >>>>> > > > >> > > > > >> how >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> much will they use in the >> > > future? >> > > > > Our >> > > > > > > > >>>>> experience >> > > > > > > > >>>>> > in >> > > > > > > > >>>>> > > > >> HDFS >> > > > > > > > >>>>> > > > >> > > was that >> > > > > > > > >>>>> > > > >> > > > > >> >> once >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> you start getting more than >> > > > > 100-200GB >> > > > > > > Java >> > > > > > > > >>>>> heap >> > > > > > > > >>>>> > > sizes, >> > > > > > > > >>>>> > > > >> full >> > > > > > > > >>>>> > > > >> > > GCs >> > > > > > > > >>>>> > > > >> > > > > >> start >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> taking minutes to finish >> when >> > > > using >> > > > > > the >> > > > > > > > >>>>> standard >> > > > > > > > >>>>> > > JVMs. >> > > > > > > > >>>>> > > > >> That >> > > > > > > > >>>>> > > > >> > > alone >> > > > > > > > >>>>> > > > >> > > > > >> is >> > > > > > > > >>>>> > > > >> > > > > >> >> a >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> good reason to go >> > multi-process >> > > or >> > > > > > > > consider >> > > > > > > > >>>>> > storing >> > > > > > > > >>>>> > > > more >> > > > > > > > >>>>> > > > >> > > things off >> > > > > > > > >>>>> > > > >> > > > > >> >> the >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java heap. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I see. Now I agree >> > > > > one-broker-per-disk >> > > > > > > > >>>>> should be >> > > > > > > > >>>>> > more >> > > > > > > > >>>>> > > > >> > > efficient in >> > > > > > > > >>>>> > > > >> > > > > >> >> terms >> > > > > > > > >>>>> > > > >> > > > > >> >>>> of >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> GC since each broker >> probably >> > > needs >> > > > > > less >> > > > > > > > >>>>> than 1/10 >> > > > > > > > >>>>> > of >> > > > > > > > >>>>> > > > the >> > > > > > > > >>>>> > > > >> > > memory >> > > > > > > > >>>>> > > > >> > > > > >> >>>> available >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> on a typical machine >> nowadays. >> > I >> > > > will >> > > > > > > > remove >> > > > > > > > >>>>> this >> > > > > > > > >>>>> > > from >> > > > > > > > >>>>> > > > >> the >> > > > > > > > >>>>> > > > >> > > reason of >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> rejection. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Disk failure is the "easy" >> > case. >> > > > > The >> > > > > > > > >>>>> "hard" case, >> > > > > > > > >>>>> > > > >> which is >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unfortunately also the much >> > more >> > > > > > common >> > > > > > > > >>>>> case, is >> > > > > > > > >>>>> > > disk >> > > > > > > > >>>>> > > > >> > > misbehavior. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Towards the end of their >> > lives, >> > > > > disks >> > > > > > > tend >> > > > > > > > >>>>> to >> > > > > > > > >>>>> > start >> > > > > > > > >>>>> > > > >> slowing >> > > > > > > > >>>>> > > > >> > > down >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unpredictably. Requests >> that >> > > > would >> > > > > > have >> > > > > > > > >>>>> completed >> > > > > > > > >>>>> > > > >> > > immediately >> > > > > > > > >>>>> > > > >> > > > > >> before >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> start taking 20, 100 500 >> > > > > milliseconds. >> > > > > > > > >>>>> Some files >> > > > > > > > >>>>> > > may >> > > > > > > > >>>>> > > > >> be >> > > > > > > > >>>>> > > > >> > > readable >> > > > > > > > >>>>> > > > >> > > > > >> and >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> other files may not be. >> > System >> > > > > calls >> > > > > > > > hang, >> > > > > > > > >>>>> > > sometimes >> > > > > > > > >>>>> > > > >> > > forever, and >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java process can't abort >> them, >> > > > > because >> > > > > > > the >> > > > > > > > >>>>> hang is >> > > > > > > > >>>>> > > in >> > > > > > > > >>>>> > > > >> the >> > > > > > > > >>>>> > > > >> > > kernel. >> > > > > > > > >>>>> > > > >> > > > > >> It >> > > > > > > > >>>>> > > > >> > > > > >> >>>> is >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> not fun when threads are >> stuck >> > > in >> > > > "D >> > > > > > > > state" >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > http://stackoverflow.com/quest >> > > > > > > > >>>>> > > > >> ions/20423521/process-perminan >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> tly-stuck-on-d-state >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> . Even kill -9 cannot >> abort >> > the >> > > > > > thread >> > > > > > > > >>>>> then. >> > > > > > > > >>>>> > > > >> Fortunately, >> > > > > > > > >>>>> > > > >> > > this is >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> rare. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I agree it is a harder >> problem >> > > and >> > > > it >> > > > > > is >> > > > > > > > >>>>> rare. We >> > > > > > > > >>>>> > > > >> probably >> > > > > > > > >>>>> > > > >> > > don't >> > > > > > > > >>>>> > > > >> > > > > >> have >> > > > > > > > >>>>> > > > >> > > > > >> >> to >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> worry about it in this KIP >> > since >> > > > this >> > > > > > > issue >> > > > > > > > >>>>> is >> > > > > > > > >>>>> > > > >> orthogonal to >> > > > > > > > >>>>> > > > >> > > > > >> whether or >> > > > > > > > >>>>> > > > >> > > > > >> >>>> not >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we use JBOD. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Another approach we should >> > > > consider >> > > > > is >> > > > > > > for >> > > > > > > > >>>>> Kafka >> > > > > > > > >>>>> > to >> > > > > > > > >>>>> > > > >> > > implement its >> > > > > > > > >>>>> > > > >> > > > > >> own >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> storage layer that would >> > stripe >> > > > > across >> > > > > > > > >>>>> multiple >> > > > > > > > >>>>> > > disks. >> > > > > > > > >>>>> > > > >> This >> > > > > > > > >>>>> > > > >> > > > > >> wouldn't >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> have to be done at the >> block >> > > > level, >> > > > > > but >> > > > > > > > >>>>> could be >> > > > > > > > >>>>> > > done >> > > > > > > > >>>>> > > > >> at the >> > > > > > > > >>>>> > > > >> > > file >> > > > > > > > >>>>> > > > >> > > > > >> >>>> level. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> We could use consistent >> > hashing >> > > to >> > > > > > > > >>>>> determine which >> > > > > > > > >>>>> > > > >> disks a >> > > > > > > > >>>>> > > > >> > > file >> > > > > > > > >>>>> > > > >> > > > > >> should >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> end up on, for example. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Are you suggesting that we >> > should >> > > > > > > > distribute >> > > > > > > > >>>>> log, >> > > > > > > > >>>>> > or >> > > > > > > > >>>>> > > > log >> > > > > > > > >>>>> > > > >> > > segment, >> > > > > > > > >>>>> > > > >> > > > > >> >> across >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> disks of brokers? I am not >> sure >> > > if >> > > > I >> > > > > > > fully >> > > > > > > > >>>>> > understand >> > > > > > > > >>>>> > > > >> this >> > > > > > > > >>>>> > > > >> > > > > >> approach. My >> > > > > > > > >>>>> > > > >> > > > > >> >>>> gut >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> feel is that this would be a >> > > > drastic >> > > > > > > > >>>>> solution that >> > > > > > > > >>>>> > > > would >> > > > > > > > >>>>> > > > >> > > require >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> non-trivial design. While >> this >> > > may >> > > > be >> > > > > > > > useful >> > > > > > > > >>>>> to >> > > > > > > > >>>>> > > Kafka, >> > > > > > > > >>>>> > > > I >> > > > > > > > >>>>> > > > >> would >> > > > > > > > >>>>> > > > >> > > > > >> prefer >> > > > > > > > >>>>> > > > >> > > > > >> >> not >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to discuss this in detail in >> > this >> > > > > > thread >> > > > > > > > >>>>> unless you >> > > > > > > > >>>>> > > > >> believe >> > > > > > > > >>>>> > > > >> > > it is >> > > > > > > > >>>>> > > > >> > > > > >> >>>> strictly >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> superior to the design in >> this >> > > KIP >> > > > in >> > > > > > > terms >> > > > > > > > >>>>> of >> > > > > > > > >>>>> > > solving >> > > > > > > > >>>>> > > > >> our >> > > > > > > > >>>>> > > > >> > > use-case. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> best, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Colin >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Thanks, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Dong >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> On Wed, Jan 25, 2017 at >> 11:34 >> > > AM, >> > > > > > Colin >> > > > > > > > >>>>> McCabe < >> > > > > > > > >>>>> > > > >> > > > > >> cmcc...@apache.org> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Hi Dong, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Thanks for the writeup! >> > It's >> > > > very >> > > > > > > > >>>>> interesting. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> I apologize in advance if >> > this >> > > > has >> > > > > > > been >> > > > > > > > >>>>> > discussed >> > > > > > > > >>>>> > > > >> > > somewhere else. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> But >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> I >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> am curious if you have >> > > > considered >> > > > > > the >> > > > > > > > >>>>> solution >> > > > > > > > >>>>> > of >> > > > > > > > >>>>> > > > >> running >> > > > > > > > >>>>> > > > >> > > > > >> multiple >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> brokers per node. >> Clearly >> > > there >> > > > > is >> > > > > > a >> > > > > > > > >>>>> memory >> > > > > > > > >>>>> > > > overhead >> > > > > > > > >>>>> > > > >> with >> > > > > > > > >>>>> > > > >> > > this >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> solution >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> because of the fixed >> cost of >> > > > > > starting >> > > > > > > > >>>>> multiple >> > > > > > > > >>>>> > > JVMs. >> > > > > > > > >>>>> > > > >> > > However, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> running >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> multiple JVMs would help >> > avoid >> > > > > > > > scalability >> > > > > > > > >>>>> > > > >> bottlenecks. >> > > > > > > > >>>>> > > > >> > > You >> > > > > > > > >>>>> > > > >> > > > > >> could >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> probably push more RPCs >> per >> > > > > second, >> > > > > > > for >> > > > > > > > >>>>> example. >> > > > > > > > >>>>> > > A >> > > > > > > > >>>>> > > > >> garbage >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> collection >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> in one broker would not >> > affect >> > > > the >> > > > > > > > >>>>> others. It >> > > > > > > > >>>>> > > would >> > > > > > > > >>>>> > > > >> be >> > > > > > > > >>>>> > > > >> > > > > >> interesting >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> see this considered in >> the >> > > > > > "alternate >> > > > > > > > >>>>> designs" >> > > > > > > > >>>>> > > > design, >> > > > > > > > >>>>> > > > >> > > even if >> > > > > > > > >>>>> > > > >> > > > > >> you >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> end >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> up deciding it's not the >> way >> > > to >> > > > > go. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> best, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Colin >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> On Thu, Jan 12, 2017, at >> > > 10:46, >> > > > > Dong >> > > > > > > Lin >> > > > > > > > >>>>> wrote: >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Hi all, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> We created KIP-112: >> Handle >> > > disk >> > > > > > > failure >> > > > > > > > >>>>> for >> > > > > > > > >>>>> > JBOD. >> > > > > > > > >>>>> > > > >> Please >> > > > > > > > >>>>> > > > >> > > find >> > > > > > > > >>>>> > > > >> > > > > >> the >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> KIP >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> wiki >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> in the link >> > > > > > > > >>>>> https://cwiki.apache.org/confl >> > > > > > > > >>>>> > > > >> > > > > >> >>>> uence/display/KAFKA/KIP- >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> >> 112%3A+Handle+disk+failure+ >> > > > > > for+JBOD. >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> This KIP is related to >> > > KIP-113 >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> < >> > > https://cwiki.apache.org/conf >> > > > > > > > >>>>> > > > >> luence/display/KAFKA/KIP- >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> >> > 113%3A+Support+replicas+moveme >> > > > > > > > >>>>> > > > >> nt+between+log+directories>: >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Support replicas >> movement >> > > > between >> > > > > > log >> > > > > > > > >>>>> > > directories. >> > > > > > > > >>>>> > > > >> They >> > > > > > > > >>>>> > > > >> > > are >> > > > > > > > >>>>> > > > >> > > > > >> >>>> needed >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> in >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> order >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> to support JBOD in >> Kafka. >> > > > Please >> > > > > > help >> > > > > > > > >>>>> review >> > > > > > > > >>>>> > the >> > > > > > > > >>>>> > > > >> KIP. You >> > > > > > > > >>>>> > > > >> > > > > >> >>>> feedback >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> is >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> appreciated! >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Thanks, >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Dong >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> >> > > > > > > > >>>>> > > > >> > > > > >> >> >> > > > > > > > >>>>> > > > >> > > > > >> >> >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > >> >> > > > > > > > >>>>> > > > >> > > > > > >> > > > > > > > >>>>> > > > >> > > >> > > > > > > > >>>>> > > > >> >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > > >> > > > > > > > >>>>> > > > >> > > > > > > > >>>>> > > >> > > > > > > > >>>>> > >> > > > > > > > >>>>> >> > > > > > > > >>>> >> > > > > > > > >>>> >> > > > > > > > >>> >> > > > > > > > >> >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >