Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Dong Lin Thu, 23 Feb 2017 23:20:04 -0800

Hey Jun,

I think there is one simpler design that doesn't need to add "create" flag
in LeaderAndIsrRequest and also remove the need for controller to
track/update which replicas are created. The idea is for each broker to
persist the created replicas in per-broker-per-topic znode. When a replica
is created or deleted, the broker updates the znode accordingly. When
broker receives LeaderAndIsrRequest, it learns the "create" flag from its
cache of these znode data. When a broker starts, it does need to read # of
znode proportional to the number of topics on its disks. But controller
still needs to learn about offline replicas from LeaderAndIsrResponse.


I think this is better than the current design. Do you have any concern
with this design?

Thanks,
Dong


On Thu, Feb 23, 2017 at 7:12 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Jun,
>
> Sure, here is my explanation.
>
> Design B would not work if it doesn't store created replicas in the ZK.
> For example, say broker B is health when it is shutdown. At this moment no
> offline replica is written in ZK for this broker. Suppose log directory is
> damaged when broker is offline, then when this broker starts, it won't know
> which replicas are in the bad log directory. And it won't be able to
> specify those offline replicas in /failed-log-directory either.
>
> Let's say design B stores created replica in ZK. Then the next problem is
> that, in the scenario that multiple log directories are damaged while
> broker is offline, when broker starts, it won't be able to know the exact
> list of offline replicas on each bad log directory. All it knows is the
> offline replicas on all those bad log directories. Thus it is impossible
> for broker to specify offline replicas per log directory in this scenario.
>
> I agree with your observation that, if admin fixes replaces dir1 with a
> good empty disk but leave dir2 untouched, design A won't create replica
> whereas design B can create it. But I am not sure that is a problem which
> we want to optimize. It seems reasonable for admin to fix both log
> directories in practice. If admin fixes only one of the two log
> directories, we can say it is a partial fix and Kafka won't re-create any
> offline replicas on dir1 and dir2. Similar to extra round of
> LeaderAndIsrRequest in case of log failure, I think this is also a pretty
> minor issue with design B.
>
> Thanks,
> Dong
>
>
> On Thu, Feb 23, 2017 at 6:46 PM, Jun Rao <j...@confluent.io> wrote:
>
>> Hi, Dong,
>>
>> My replies are inlined below.
>>
>> On Thu, Feb 23, 2017 at 4:47 PM, Dong Lin <lindon...@gmail.com> wrote:
>>
>> > Hey Jun,
>> >
>> > Thanks for you reply! Let me first comment on the things that you
>> listed as
>> > advantage of B over A.
>> >
>> > 1) No change in LeaderAndIsrRequest protocol.
>> >
>> > I agree with this.
>> >
>> > 2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK
>> > writes to record the created flag.
>> >
>> > I don't think this is true. There will be one round of
>> LeaderAndIsrRequest
>> > in both A and B. In the design A controller needs to write to ZK once to
>> > record this replica as created. The design B the broker needs to write
>> > zookeeper once to record this replica as created. So there is same
>> number
>> > of LeaderAndIsrRequest and ZK writes.
>> >
>> > Broker needs to record created replica in design B so that when it
>> > bootstraps with failed log directory, the broker can derive the offline
>> > replicas as the difference between created replicas and replicas found
>> on
>> > good log directories.
>> >
>> >
>> Design B actually doesn't write created replicas in ZK. When a broker
>> starts up, all offline replicas are stored in the /failed-log-directory
>> path in ZK. So if a replica is not there and is not in the live log
>> directories either, it's never created. Does this work?
>>
>>
>>
>> > 3) Step 2. One less round of LeaderAndIsrRequest and no additional
>> logic to
>> > handle LeaderAndIsrResponse.
>> >
>> > While I agree there is one less round of LeaderAndIsrRequest in design
>> B, I
>> > don't think one additional LeaderAndIsrRequest to handle log directory
>> > failure is a big deal given that it doesn't happen frequently.
>> >
>> > Also, while there is no additional logic to handle LeaderAndIsrResponse
>> in
>> > design B, I actually think this is something that controller should do
>> > anyway. Say the broker stops responding to any requests without removing
>> > itself from zookeeper, the only way for controller to realize this and
>> > re-elect leader is to send request to this broker and handle response.
>> The
>> > is a problem that we don't do it as of now.
>> >
>> > 4) Step 6. Additional ZK reads proportional to # of failed log
>> directories,
>> > instead of # of partitions.
>> >
>> > If one znode is able to describe all topic partitions in a log
>> directory,
>> > then the existing znode /brokers/topics/[topic] should be able to
>> describe
>> > created replicas in addition to the assigned replicas for every
>> partition
>> > of the topic. In this case, design A requires no additional ZK reads
>> > whereas design B ZK reads proportional to # of failed log directories.
>> >
>> > 5) Step 3. In design A, if a broker is restarted and the failed log
>> > directory is unreadable, the broker doesn't know which replicas are on
>> the
>> > failed log directory. So, when the broker receives the LeadAndIsrRequest
>> > with created = false, it's bit hard for the broker to decide whether it
>> > should create the missing replica on other log directories. This is
>> easier
>> > in design B since the list of failed replicas are persisted in ZK.
>> >
>> > I don't understand why it is hard for broker to make decision in design
>> A.
>> > With design A, if a broker is started with a failed log directory and it
>> > receives LeaderAndIsrRequest with created=false for a replica that can
>> not
>> > be found on any good log directory, broker will not create this
>> replica. Is
>> > there any drawback with this approach?
>> >
>> >
>> >
>> I was thinking about this case. Suppose two log directories dir1 and dir2
>> fail. The admin replaces dir1 with an empty new disk. The broker is
>> restarted with dir1 alive, and dir2 still failing. Now, when receiving a
>> LeaderAndIsrRequest including a replica that was previously in dir1, the
>> broker won't be able to create that replica when it could.
>>
>>
>>
>> > Here is my summary of pros and cons of design B as compared to design A.
>> >
>> > pros:
>> >
>> > 1) No change to LeaderAndIsrRequest.
>> > 2) One less round of LeaderAndIsrRequest in case of log directory
>> failure.
>> >
>> > cons:
>> >
>> > 1) This is impossible for broker to figure out the log directory of
>> offline
>> > replicas for failed-log-directory/[directory] if multiple log
>> directories
>> > are unreadable when broker starts.
>> >
>> >
>> Hmm, I am not sure that I get this point. If multiple log directories
>> fail,
>> design B stores each directory under /failed-log-directory, right?
>>
>> Thanks,
>>
>> Jun
>>
>>
>>
>> > 2) The znode size limit of failed-log-directory/[directory] essentially
>> > limits the number of topic partitions that can exist on a log
>> directory. It
>> > becomes more of a problem when a broker is configured to use multiple
>> log
>> > directories each of which is a RAID-10 of large capacity. While this may
>> > not be a problem in practice with additional requirement (e.g. don't use
>> > more than one log directory if using RAID-10), ideally we want to avoid
>> > such limit.
>> >
>> > 3) Extra ZK read of failed-log-directory/[directory] when broker starts
>> >
>> >
>> > My main concern with the design B is the use of znode
>> > /brokers/ids/[brokerId]/failed-log-directory/[directory]. I don't
>> really
>> > think other pros/cons of design B matter to us. Does my summary make
>> sense?
>> >
>> > Thanks,
>> > Dong
>> >
>> >
>> > On Thu, Feb 23, 2017 at 2:20 PM, Jun Rao <j...@confluent.io> wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > Just so that we are on the same page. Let me spec out the alternative
>> > > design a bit more and then compare. Let's call the current design A
>> and
>> > the
>> > > alternative design B.
>> > >
>> > > Design B:
>> > >
>> > > New ZK path
>> > > failed log directory path (persistent): This is created by a broker
>> when
>> > a
>> > > log directory fails and is potentially removed when the broker is
>> > > restarted.
>> > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of
>> the
>> > > replicas in the log directory }.
>> > >
>> > > *1. Topic gets created*
>> > > - Works the same as before.
>> > >
>> > > *2. A log directory stops working on a broker during runtime*
>> > >
>> > > - The controller watches the path /failed-log-directory for the new
>> > znode.
>> > >
>> > > - The broker detects an offline log directory during runtime and marks
>> > > affected replicas as offline in memory.
>> > >
>> > > - The broker writes the failed directory and all replicas in the
>> failed
>> > > directory under /failed-log-directory/directory1.
>> > >
>> > > - The controller reads /failed-log-directory/directory1 and stores in
>> > > memory a list of failed replicas due to disk failures.
>> > >
>> > > - The controller moves those replicas due to disk failure to offline
>> > state
>> > > and triggers the state change in replica state machine.
>> > >
>> > >
>> > > *3. Broker is restarted*
>> > >
>> > > - The broker reads /brokers/ids/[brokerId]/failed-log-directory, if
>> any.
>> > >
>> > > - For each failed log directory it reads from ZK, if the log directory
>> > > exists in log.dirs and is accessible now, or if the log directory no
>> > longer
>> > > exists in log.dirs, remove that log directory from
>> failed-log-directory.
>> > > Otherwise, the broker loads replicas in the failed log directory in
>> > memory
>> > > as offline.
>> > >
>> > > - The controller handles the failed log directory change event, if
>> needed
>> > > (same as #2).
>> > >
>> > > - The controller handles the broker registration event.
>> > >
>> > >
>> > > *6. Controller failover*
>> > > - Controller reads all child paths under /failed-log-directory to
>> rebuild
>> > > the list of failed replicas due to disk failures. Those replicas will
>> be
>> > > transitioned to the offline state during controller initialization.
>> > >
>> > > Comparing this with design A, I think the following are the things
>> that
>> > > design B simplifies.
>> > > * No change in LeaderAndIsrRequest protocol.
>> > > * Step 1. One less round of LeaderAndIsrRequest and no additional ZK
>> > writes
>> > > to record the created flag.
>> > > * Step 2. One less round of LeaderAndIsrRequest and no additional
>> logic
>> > to
>> > > handle LeaderAndIsrResponse.
>> > > * Step 6. Additional ZK reads proportional to # of failed log
>> > directories,
>> > > instead of # of partitions.
>> > > * Step 3. In design A, if a broker is restarted and the failed log
>> > > directory is unreadable, the broker doesn't know which replicas are on
>> > the
>> > > failed log directory. So, when the broker receives the
>> LeadAndIsrRequest
>> > > with created = false, it's bit hard for the broker to decide whether
>> it
>> > > should create the missing replica on other log directories. This is
>> > easier
>> > > in design B since the list of failed replicas are persisted in ZK.
>> > >
>> > > Now, for some of the other things that you mentioned.
>> > >
>> > > * What happens if a log directory is renamed?
>> > > I think this can be handled in the same way as non-existing log
>> directory
>> > > during broker restart.
>> > >
>> > > * What happens if replicas are moved manually across disks?
>> > > Good point. Well, if all log directories are available, the failed log
>> > > directory path will be cleared. In the rarer case that a log
>> directory is
>> > > still offline and one of the replicas registered in the failed log
>> > > directory shows up in another available log directory, I am not quite
>> > sure.
>> > > Perhaps the simplest approach is to just error out and let the admin
>> fix
>> > > things manually?
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Wed, Feb 22, 2017 at 3:39 PM, Dong Lin <lindon...@gmail.com>
>> wrote:
>> > >
>> > > > Hey Jun,
>> > > >
>> > > > Thanks much for the explanation. I have some questions about 21 but
>> > that
>> > > is
>> > > > less important than 20. 20 would require considerable change to the
>> KIP
>> > > and
>> > > > probably requires weeks to discuss again. Thus I would like to be
>> very
>> > > sure
>> > > > that we agree on the problems with the current design as you
>> mentioned
>> > > and
>> > > > there is no foreseeable problem with the alternate design.
>> > > >
>> > > > Please see below I detail response. To summarize my points, I
>> couldn't
>> > > > figure out any non-trival drawback of the current design as
>> compared to
>> > > the
>> > > > alternative design; and I couldn't figure out a good way to store
>> > offline
>> > > > replicas in the alternative design. Can you see if these points make
>> > > sense?
>> > > > Thanks in advance for your time!!
>> > > >
>> > > >
>> > > > 1) The alternative design requires slightly more dependency on ZK.
>> > While
>> > > > both solutions store created replicas in the ZK, the alternative
>> design
>> > > > would also store offline replicas in ZK but the current design
>> doesn't.
>> > > > Thus
>> > > >
>> > > > 2) I am not sure that we should store offline replicas in znode
>> > > > /brokers/ids/[brokerId]/failed-log-directory/[directory]. We
>> probably
>> > > > don't
>> > > > want to expose log directory path in zookeeper based on the concept
>> > that
>> > > we
>> > > > should only store logical information (e.g. topic, brokerId) in
>> > > zookeeper's
>> > > > path name. More specifically, we probably don't want to rename path
>> in
>> > > > zookeeper simply because user renamed a log director. And we
>> probably
>> > > don't
>> > > > want to read/write these znode just because user manually moved
>> > replicas
>> > > > between log directories.
>> > > >
>> > > > 3) I couldn't find a good way to store offline replicas in ZK in the
>> > > > alternative design. We can store this information one znode
>> per-topic,
>> > > > per-brokerId, or per-brokerId-topic. All these choices have their
>> own
>> > > > problems. If we store it in per-topic znode then multiple brokers
>> may
>> > > need
>> > > > to read/write offline replicas in the same znode which is generally
>> > bad.
>> > > If
>> > > > we store it per-brokerId then we effectively limit the maximum
>> number
>> > of
>> > > > topic-partition that can be stored on a broker by the znode size
>> limit.
>> > > > This contradicts the idea to expand the single broker capacity by
>> > > throwing
>> > > > in more disks. If we store it per-brokerId-topic, then when
>> controller
>> > > > starts, it needs to read number of brokerId*topic znodes which may
>> > double
>> > > > the overall znode reads during controller startup.
>> > > >
>> > > > 4) The alternative design is less efficient than the current design
>> in
>> > > case
>> > > > of log directory failure. The alternative design requires extra
>> znode
>> > > reads
>> > > > in order to read offline replicas from zk while the current design
>> > > requires
>> > > > only one pair of LeaderAndIsrRequest and LeaderAndIsrResponse. The
>> > extra
>> > > > znode reads will be proportional to the number of topics on the
>> broker
>> > if
>> > > > we store offline replicas per-brokerId-topic.
>> > > >
>> > > > 5) While I agree that the failure reporting should be done where the
>> > > > failure is originated, I think the current design is consistent with
>> > what
>> > > > we are already doing. With the current design, broker will send
>> > > > notification via zookeeper and controller will send
>> LeaderAndIsrRequest
>> > > to
>> > > > broker. This is similar to how broker sends isr change notification
>> and
>> > > > controller read latest isr from broker. If we do want broker to
>> report
>> > > > failure directly to controller, we should probably have broker send
>> RPC
>> > > > directly to controller as it sends ControllerShutdownRequest. I can
>> do
>> > > this
>> > > > as well.
>> > > >
>> > > > 6) I don't think the current design requires additional state
>> > management
>> > > in
>> > > > each of the existing state handling such as topic creation or
>> > controller
>> > > > failover. All these existing logic should stay exactly the same
>> except
>> > > that
>> > > > the controller should recognize offline replicas on the live broker
>> > > instead
>> > > > of assuming all replicas on live brokers are live. But this
>> additional
>> > > > change is required in both the current design and the alternate
>> design.
>> > > > Thus there should be no difference between current design and the
>> > > alternate
>> > > > design with respect to these existing state handling logic in
>> > controller.
>> > > >
>> > > > 7) While I agree that the current design requires additional
>> complexity
>> > > in
>> > > > the controller in order to handle LeaderAndIsrResponse and
>> potentially
>> > > > change partition and replica state to offline in the sate machines,
>> I
>> > > think
>> > > > such logic is necessary in a well-designed controller either with
>> the
>> > > > alternate design or even without JBOD. Controller should be able to
>> > > handle
>> > > > error (e.g. ClusterAuthorizationException) in LeaderAndIsrResponse
>> and
>> > > > every responses in general. For example, if the controller hasn't
>> > > received
>> > > > LeaderAndIsrResponse after a given period if time, it probably means
>> > the
>> > > > broker has hang and the controller should consider this broker as
>> > offline
>> > > > and re-elect leader from other brokers. This would actually fix some
>> > > > problem we have seen before at LinkedIn, where broker hangs due to
>> > > > RAID-controller failure. In other words, I think it is a good idea
>> for
>> > > > controller to handle response.
>> > > >
>> > > > 8) I am not sure that the additional state management to handle
>> > > > LeaderAndIsrResponse causes new types of synchronization. It is true
>> > that
>> > > > the logic is not handled ZK event handling
>> > > > thread. But the existing ControllerShutdownRequest is also not
>> handled
>> > by
>> > > > ZK event handling thread. The LeaderAndIsrReponse can be handled by
>> the
>> > > > same thread is that currently handling ControllerShutdownRequest so
>> > that
>> > > we
>> > > > don't require new type of synchronization. Further, it should be
>> rare
>> > to
>> > > > require additional synchronization to handle LeaderAndIsrReponse
>> > because
>> > > we
>> > > > only need synchronization when there are offline replicas.
>> > > >
>> > > >
>> > > > Thanks,
>> > > > Dong
>> > > >
>> > > > On Wed, Feb 22, 2017 at 10:36 AM, Jun Rao <j...@confluent.io> wrote:
>> > > >
>> > > > > Hi, Dong, Jiangjie,
>> > > > >
>> > > > > 20. (1) I agree that ideally we'd like to use direct RPC for
>> > > > > broker-to-broker communication instead of ZK. However, in the
>> > > alternative
>> > > > > design, the failed log directory path also serves as the
>> persistent
>> > > state
>> > > > > for remembering the offline partitions. This is similar to the
>> > > > > controller_managed_state path in your design. The difference is
>> that
>> > > the
>> > > > > alternative design stores the state in fewer ZK paths, which helps
>> > > reduce
>> > > > > the controller failover time. (2) I agree that we want the
>> controller
>> > > to
>> > > > be
>> > > > > the single place to make decisions. However, intuitively, the
>> failure
>> > > > > reporting should be done where the failure is originated. For
>> > example,
>> > > > if a
>> > > > > broker fails, the broker reports failure by de-registering from
>> ZK.
>> > The
>> > > > > failed log directory path is similar in that regard. (3) I am not
>> > > worried
>> > > > > about the additional load from extra LeaderAndIsrRequest. What I
>> > worry
>> > > > > about is any unnecessary additional complexity in the controller.
>> To
>> > > me,
>> > > > > the additional complexity in the current design is the additional
>> > state
>> > > > > management in each of the existing state handling (e.g., topic
>> > > creation,
>> > > > > controller failover, etc), and the additional synchronization
>> since
>> > the
>> > > > > additional state management is not initiated from the ZK event
>> > handling
>> > > > > thread.
>> > > > >
>> > > > > 21. One of the reasons that we need to send a StopReplicaRequest
>> to
>> > > > offline
>> > > > > replica is to handle controlled shutdown. In that case, a broker
>> is
>> > > still
>> > > > > alive, but indicates to the controller that it plans to shut down.
>> > > Being
>> > > > > able to stop the replica in the shutting down broker reduces
>> churns
>> > in
>> > > > ISR.
>> > > > > So, for simplicity, it's probably easier to always send a
>> > > > > StopReplicaRequest
>> > > > > to any offline replica.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > >
>> > > > > On Tue, Feb 21, 2017 at 2:37 PM, Dong Lin <lindon...@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Hey Jun,
>> > > > > >
>> > > > > > Thanks much for your comments.
>> > > > > >
>> > > > > > I actually proposed the design to store both offline replicas
>> and
>> > > > created
>> > > > > > replicas in per-broker znode before switching to the design in
>> the
>> > > > > current
>> > > > > > KIP. The current design stores created replicas in per-partition
>> > > znode
>> > > > > and
>> > > > > > transmits offline replicas via LeaderAndIsrResponse. The
>> original
>> > > > > solution
>> > > > > > is roughly the same as what you suggested. The advantage of the
>> > > current
>> > > > > > solution is kind of philosophical: 1) we want to transmit data
>> > (e.g.
>> > > > > > offline replicas) using RPC and reduce dependency on zookeeper;
>> 2)
>> > we
>> > > > > want
>> > > > > > controller to be the only one that determines any state (e.g.
>> > offline
>> > > > > > replicas) that will be exposed to user. The advantage of the
>> > solution
>> > > > to
>> > > > > > store offline replica in zookeeper is that we can save one
>> > roundtrip
>> > > > time
>> > > > > > for controller to handle log directory failure. However, this
>> extra
>> > > > > > roundtrip time should not be a big deal since the log directory
>> > > failure
>> > > > > is
>> > > > > > rare and inefficiency of extra latency is less of a problem when
>> > > there
>> > > > is
>> > > > > > log directory failure.
>> > > > > >
>> > > > > > Do you think the two philosophical advantages of the current KIP
>> > make
>> > > > > > sense? If not, then I can switch to the original design that
>> stores
>> > > > > offline
>> > > > > > replicas in zookeeper. It is actually written already. One
>> > > disadvantage
>> > > > > is
>> > > > > > that we have to make non-trivial change the KIP (e.g. no create
>> > flag
>> > > in
>> > > > > > LeaderAndIsrRequest and no created flag zookeeper) and restart
>> this
>> > > KIP
>> > > > > > discussion.
>> > > > > >
>> > > > > > Regarding 21, it seems to me that LeaderAndIsrRequest/
>> > > > StopReplicaRequest
>> > > > > > only makes sense when broker can make the choice (e.g. fetch
>> data
>> > for
>> > > > > this
>> > > > > > replica or not). In the case that the log directory of the
>> replica
>> > is
>> > > > > > already offline, broker have to stop fetching data for this
>> replica
>> > > > > > regardless of what controller tells it to do. Thus it seems
>> cleaner
>> > > for
>> > > > > > broker to stop fetch data for this replica immediately. The
>> > advantage
>> > > > of
>> > > > > > this solution is that the controller logic is simpler since it
>> > > doesn't
>> > > > > need
>> > > > > > to send StopReplicaRequest in case of log directory failure, and
>> > the
>> > > > > log4j
>> > > > > > log is also cleaner. Is there specific advantage of having
>> > controller
>> > > > > send
>> > > > > > tells broker to stop fetching data for offline replicas?
>> > > > > >
>> > > > > > Regarding 22, I agree with your observation that it will
>> happen. I
>> > > will
>> > > > > > update the KIP and specify that broker will exist with proper
>> error
>> > > > > message
>> > > > > > in the log and user needs to manually remove partitions and
>> restart
>> > > the
>> > > > > > broker.
>> > > > > >
>> > > > > > Thanks!
>> > > > > > Dong
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao <j...@confluent.io>
>> > wrote:
>> > > > > >
>> > > > > > > Hi, Dong,
>> > > > > > >
>> > > > > > > Sorry for the delay. A few more comments.
>> > > > > > >
>> > > > > > > 20. One complexity that I found in the current KIP is that the
>> > way
>> > > > the
>> > > > > > > broker communicates failed replicas to the controller is
>> > > inefficient.
>> > > > > > When
>> > > > > > > a log directory fails, the broker only sends an indication
>> > through
>> > > ZK
>> > > > > to
>> > > > > > > the controller and the controller has to issue a
>> > > LeaderAndIsrRequest
>> > > > to
>> > > > > > > discover which replicas are offline due to log directory
>> failure.
>> > > An
>> > > > > > > alternative approach is that when a log directory fails, the
>> > broker
>> > > > > just
>> > > > > > > writes the failed the directory and the corresponding topic
>> > > > partitions
>> > > > > > in a
>> > > > > > > new failed log directory ZK path like the following.
>> > > > > > >
>> > > > > > > Failed log directory path:
>> > > > > > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => {
>> > json
>> > > of
>> > > > > the
>> > > > > > > topic partitions in the log directory }.
>> > > > > > >
>> > > > > > > The controller just watches for child changes in
>> > > > > > > /brokers/ids/[brokerId]/failed-log-directory.
>> > > > > > > After reading this path, the broker knows the exact set of
>> > replicas
>> > > > > that
>> > > > > > > are offline and can trigger that replica state change
>> > accordingly.
>> > > > This
>> > > > > > > saves an extra round of LeaderAndIsrRequest handling.
>> > > > > > >
>> > > > > > > With this new ZK path, we get probably get rid
>> > > > > of/broker/topics/[topic]/
>> > > > > > > partitions/[partitionId]/controller_managed_state. The
>> creation
>> > > of a
>> > > > > new
>> > > > > > > replica is expected to always succeed unless all log
>> directories
>> > > > fail,
>> > > > > in
>> > > > > > > which case, the broker goes down anyway. Then, during
>> controller
>> > > > > > failover,
>> > > > > > > the controller just needs to additionally read from ZK the
>> extra
>> > > > failed
>> > > > > > log
>> > > > > > > directory paths, which is many fewer than topics or
>> partitions.
>> > > > > > >
>> > > > > > > On broker startup, if a log directory becomes available, the
>> > > > > > corresponding
>> > > > > > > log directory path in ZK will be removed.
>> > > > > > >
>> > > > > > > The downside of this approach is that the value of this new ZK
>> > path
>> > > > can
>> > > > > > be
>> > > > > > > large. However, even with 5K partition per log directory and
>> 100
>> > > > bytes
>> > > > > > per
>> > > > > > > partition, the size of the value is 500KB, still less than the
>> > > > default
>> > > > > > 1MB
>> > > > > > > znode limit in ZK.
>> > > > > > >
>> > > > > > > 21. "Broker will remove offline replica from its replica
>> fetcher
>> > > > > > threads."
>> > > > > > > The proposal lets the broker remove the replica from the
>> replica
>> > > > > fetcher
>> > > > > > > thread when it detects a directory failure. An alternative is
>> to
>> > > only
>> > > > > do
>> > > > > > > that until the broker receives the LeaderAndIsrRequest/
>> > > > > > StopReplicaRequest.
>> > > > > > > The benefit of this is that the controller is the only one who
>> > > > decides
>> > > > > > > which replica to be removed from the replica fetcher threads.
>> The
>> > > > > broker
>> > > > > > > also doesn't need additional logic to remove the replica from
>> > > replica
>> > > > > > > fetcher threads. The downside is that in a small window, the
>> > > replica
>> > > > > > fetch
>> > > > > > > thread will keep writing to the failed log directory and may
>> > > pollute
>> > > > > the
>> > > > > > > log4j log.
>> > > > > > >
>> > > > > > > 22. In the current design, there is a potential corner case
>> issue
>> > > > that
>> > > > > > the
>> > > > > > > same partition may exist in more than one log directory at
>> some
>> > > > point.
>> > > > > > > Consider the following steps: (1) a new topic t1 is created
>> and
>> > the
>> > > > > > > controller sends LeaderAndIsrRequest to a broker; (2) the
>> broker
>> > > > > creates
>> > > > > > > partition t1-p1 in log dir1; (3) before the broker sends a
>> > > response,
>> > > > it
>> > > > > > > goes down; (4) the broker is restarted with log dir1
>> unreadable;
>> > > (5)
>> > > > > the
>> > > > > > > broker receives a new LeaderAndIsrRequest and creates
>> partition
>> > > t1-p1
>> > > > > on
>> > > > > > > log dir2; (6) at some point, the broker is restarted with log
>> > dir1
>> > > > > fixed.
>> > > > > > > Now partition t1-p1 is in two log dirs. The alternative
>> approach
>> > > > that I
>> > > > > > > suggested above may suffer from a similar corner case issue.
>> > Since
>> > > > this
>> > > > > > is
>> > > > > > > rare, if the broker detects this during broker startup, it can
>> > > > probably
>> > > > > > > just log an error and exit. The admin can remove the redundant
>> > > > > partitions
>> > > > > > > manually and then restart the broker.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > > On Sat, Feb 18, 2017 at 9:31 PM, Dong Lin <
>> lindon...@gmail.com>
>> > > > wrote:
>> > > > > > >
>> > > > > > > > Hey Jun,
>> > > > > > > >
>> > > > > > > > Could you please let me know if the solutions above could
>> > address
>> > > > > your
>> > > > > > > > concern? I really want to move the discussion forward.
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Dong
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin <
>> lindon...@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hey Jun,
>> > > > > > > > >
>> > > > > > > > > Thanks for all your help and time to discuss this KIP.
>> When
>> > you
>> > > > get
>> > > > > > the
>> > > > > > > > > time, could you let me know if the previous answers
>> address
>> > the
>> > > > > > > concern?
>> > > > > > > > >
>> > > > > > > > > I think the more interesting question in your last email
>> is
>> > > where
>> > > > > we
>> > > > > > > > > should store the "created" flag in ZK. I proposed the
>> > solution
>> > > > > that I
>> > > > > > > > like
>> > > > > > > > > most, i.e. store it together with the replica assignment
>> data
>> > > in
>> > > > > the
>> > > > > > > > /brokers/topics/[topic].
>> > > > > > > > > In order to expedite discussion, let me provide another
>> two
>> > > ideas
>> > > > > to
>> > > > > > > > > address the concern just in case the first idea doesn't
>> work:
>> > > > > > > > >
>> > > > > > > > > - We can avoid extra controller ZK read when there is no
>> disk
>> > > > > failure
>> > > > > > > > > (95% of time?). When controller starts, it doesn't
>> > > > > > > > > read controller_managed_state in ZK and sends
>> > > LeaderAndIsrRequest
>> > > > > > with
>> > > > > > > > > "create = false". Only if LeaderAndIsrResponse shows
>> failure
>> > > for
>> > > > > any
>> > > > > > > > > replica, then controller will read
>> controller_managed_state
>> > for
>> > > > > this
>> > > > > > > > > partition and re-send LeaderAndIsrRequset with
>> "create=true"
>> > if
>> > > > > this
>> > > > > > > > > replica has not been created.
>> > > > > > > > >
>> > > > > > > > > - We can significantly reduce this ZK read time by making
>> > > > > > > > > controller_managed_state a topic level information in ZK,
>> > e.g.
>> > > > > > > > > /brokers/topics/[topic]/state. Given that most topic has
>> 10+
>> > > > > > partition,
>> > > > > > > > > the extra ZK read time should be less than 10% of the
>> > existing
>> > > > > total
>> > > > > > zk
>> > > > > > > > > read time during controller failover.
>> > > > > > > > >
>> > > > > > > > > Thanks!
>> > > > > > > > > Dong
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Tue, Feb 14, 2017 at 7:30 AM, Dong Lin <
>> > lindon...@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > >> Hey Jun,
>> > > > > > > > >>
>> > > > > > > > >> I just realized that you may be suggesting that a tool
>> for
>> > > > listing
>> > > > > > > > >> offline directories is necessary for KIP-112 by asking
>> > whether
>> > > > > > KIP-112
>> > > > > > > > and
>> > > > > > > > >> KIP-113 will be in the same release. I think such a tool
>> is
>> > > > useful
>> > > > > > but
>> > > > > > > > >> doesn't have to be included in KIP-112. This is because
>> as
>> > of
>> > > > now
>> > > > > > > admin
>> > > > > > > > >> needs to log into broker machine and check broker log to
>> > > figure
>> > > > > out
>> > > > > > > the
>> > > > > > > > >> cause of broker failure and the bad log directory in
>> case of
>> > > > disk
>> > > > > > > > failure.
>> > > > > > > > >> The KIP-112 won't make it harder since admin can still
>> > figure
>> > > > out
>> > > > > > the
>> > > > > > > > bad
>> > > > > > > > >> log directory by doing the same thing. Thus it is
>> probably
>> > OK
>> > > to
>> > > > > > just
>> > > > > > > > >> include this script in KIP-113. Regardless, my hope is to
>> > > finish
>> > > > > > both
>> > > > > > > > KIPs
>> > > > > > > > >> ASAP and make them in the same release since both KIPs
>> are
>> > > > needed
>> > > > > > for
>> > > > > > > > the
>> > > > > > > > >> JBOD setup.
>> > > > > > > > >>
>> > > > > > > > >> Thanks,
>> > > > > > > > >> Dong
>> > > > > > > > >>
>> > > > > > > > >> On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin <
>> > > lindon...@gmail.com>
>> > > > > > > wrote:
>> > > > > > > > >>
>> > > > > > > > >>> And the test plan has also been updated to simulate disk
>> > > > failure
>> > > > > by
>> > > > > > > > >>> changing log directory permission to 000.
>> > > > > > > > >>>
>> > > > > > > > >>> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin <
>> > > lindon...@gmail.com
>> > > > >
>> > > > > > > wrote:
>> > > > > > > > >>>
>> > > > > > > > >>>> Hi Jun,
>> > > > > > > > >>>>
>> > > > > > > > >>>> Thanks for the reply. These comments are very helpful.
>> Let
>> > > me
>> > > > > > answer
>> > > > > > > > >>>> them inline.
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao <
>> > j...@confluent.io>
>> > > > > > wrote:
>> > > > > > > > >>>>
>> > > > > > > > >>>>> Hi, Dong,
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> Thanks for the reply. A few more replies and new
>> comments
>> > > > > below.
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin <
>> > > > lindon...@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> > Hi Jun,
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > Thanks for the detailed comments. Please see answers
>> > > > inline:
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao <
>> > > j...@confluent.io
>> > > > >
>> > > > > > > wrote:
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > > Hi, Dong,
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > Thanks for the updated wiki. A few comments below.
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > 1. Topics get created
>> > > > > > > > >>>>> > > 1.1 Instead of storing successfully created
>> replicas
>> > in
>> > > > ZK,
>> > > > > > > could
>> > > > > > > > >>>>> we
>> > > > > > > > >>>>> > store
>> > > > > > > > >>>>> > > unsuccessfully created replicas in ZK? Since the
>> > latter
>> > > > is
>> > > > > > less
>> > > > > > > > >>>>> common,
>> > > > > > > > >>>>> > it
>> > > > > > > > >>>>> > > probably reduces the load on ZK.
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > We can store unsuccessfully created replicas in ZK.
>> > But I
>> > > > am
>> > > > > > not
>> > > > > > > > >>>>> sure if
>> > > > > > > > >>>>> > that can reduce write load on ZK.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > If we want to reduce write load on ZK using by store
>> > > > > > > unsuccessfully
>> > > > > > > > >>>>> created
>> > > > > > > > >>>>> > replicas in ZK, then broker should not write to ZK
>> if
>> > all
>> > > > > > > replicas
>> > > > > > > > >>>>> are
>> > > > > > > > >>>>> > successfully created. It means that if
>> > > > > > > > /broker/topics/[topic]/partiti
>> > > > > > > > >>>>> > ons/[partitionId]/controller_managed_state doesn't
>> > exist
>> > > > in
>> > > > > ZK
>> > > > > > > for
>> > > > > > > > >>>>> a given
>> > > > > > > > >>>>> > partition, we have to assume all replicas of this
>> > > partition
>> > > > > > have
>> > > > > > > > been
>> > > > > > > > >>>>> > successfully created and send LeaderAndIsrRequest
>> with
>> > > > > create =
>> > > > > > > > >>>>> false. This
>> > > > > > > > >>>>> > becomes a problem if controller crashes before
>> > receiving
>> > > > > > > > >>>>> > LeaderAndIsrResponse to validate whether a replica
>> has
>> > > been
>> > > > > > > > created.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > I think this approach and reduce the number of bytes
>> > > stored
>> > > > > in
>> > > > > > > ZK.
>> > > > > > > > >>>>> But I am
>> > > > > > > > >>>>> > not sure if this is a concern.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> I was mostly concerned about the controller failover
>> > time.
>> > > > > > > Currently,
>> > > > > > > > >>>>> the
>> > > > > > > > >>>>> controller failover is likely dominated by the cost of
>> > > > reading
>> > > > > > > > >>>>> topic/partition level information from ZK. If we add
>> > > another
>> > > > > > > > partition
>> > > > > > > > >>>>> level path in ZK, it probably will double the
>> controller
>> > > > > failover
>> > > > > > > > >>>>> time. If
>> > > > > > > > >>>>> the approach of representing the non-created replicas
>> > > doesn't
>> > > > > > work,
>> > > > > > > > >>>>> have
>> > > > > > > > >>>>> you considered just adding the created flag in the
>> > > > leaderAndIsr
>> > > > > > > path
>> > > > > > > > >>>>> in ZK?
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>> Yes, I have considered adding the created flag in the
>> > > > > leaderAndIsr
>> > > > > > > > path
>> > > > > > > > >>>> in ZK. If we were to add created flag per replica in
>> the
>> > > > > > > > >>>> LeaderAndIsrRequest, then it requires a lot of change
>> in
>> > the
>> > > > > code
>> > > > > > > > base.
>> > > > > > > > >>>>
>> > > > > > > > >>>> If we don't add created flag per replica in the
>> > > > > > LeaderAndIsrRequest,
>> > > > > > > > >>>> then the information in leaderAndIsr path in ZK and
>> > > > > > > > LeaderAndIsrRequest
>> > > > > > > > >>>> would be different. Further, the procedure for broker
>> to
>> > > > update
>> > > > > > ISR
>> > > > > > > > in ZK
>> > > > > > > > >>>> will be a bit complicated. When leader updates
>> > leaderAndIsr
>> > > > path
>> > > > > > in
>> > > > > > > > ZK, it
>> > > > > > > > >>>> will have to first read created flags from ZK, change
>> isr,
>> > > and
>> > > > > > write
>> > > > > > > > >>>> leaderAndIsr back to ZK. And it needs to check znode
>> > version
>> > > > and
>> > > > > > > > re-try
>> > > > > > > > >>>> write operation in ZK if controller has updated ZK
>> during
>> > > this
>> > > > > > > > period. This
>> > > > > > > > >>>> is in contrast to the current implementation where the
>> > > leader
>> > > > > > either
>> > > > > > > > gets
>> > > > > > > > >>>> all the information from LeaderAndIsrRequest sent by
>> > > > controller,
>> > > > > > or
>> > > > > > > > >>>> determine the infromation by itself (e.g. ISR), before
>> > > writing
>> > > > > to
>> > > > > > > > >>>> leaderAndIsr path in ZK.
>> > > > > > > > >>>>
>> > > > > > > > >>>> It seems to me that the above solution is a bit
>> > complicated
>> > > > and
>> > > > > > not
>> > > > > > > > >>>> clean. Thus I come up with the design in this KIP to
>> store
>> > > > this
>> > > > > > > > created
>> > > > > > > > >>>> flag in a separate zk path. The path is named
>> > > > > > > > controller_managed_state to
>> > > > > > > > >>>> indicate that we can store in this znode all
>> information
>> > > that
>> > > > is
>> > > > > > > > managed by
>> > > > > > > > >>>> controller only, as opposed to ISR.
>> > > > > > > > >>>>
>> > > > > > > > >>>> I agree with your concern of increased ZK read time
>> during
>> > > > > > > controller
>> > > > > > > > >>>> failover. How about we store the "created" information
>> in
>> > > the
>> > > > > > > > >>>> znode /brokers/topics/[topic]? We can change that
>> znode to
>> > > > have
>> > > > > > the
>> > > > > > > > >>>> following data format:
>> > > > > > > > >>>>
>> > > > > > > > >>>> {
>> > > > > > > > >>>>   "version" : 2,
>> > > > > > > > >>>>   "created" : {
>> > > > > > > > >>>>     "1" : [1, 2, 3],
>> > > > > > > > >>>>     ...
>> > > > > > > > >>>>   }
>> > > > > > > > >>>>   "partition" : {
>> > > > > > > > >>>>     "1" : [1, 2, 3],
>> > > > > > > > >>>>     ...
>> > > > > > > > >>>>   }
>> > > > > > > > >>>> }
>> > > > > > > > >>>>
>> > > > > > > > >>>> We won't have extra zk read using this solution. It
>> also
>> > > seems
>> > > > > > > > >>>> reasonable to put the partition assignment information
>> > > > together
>> > > > > > with
>> > > > > > > > >>>> replica creation information. The latter is only
>> changed
>> > > once
>> > > > > > after
>> > > > > > > > the
>> > > > > > > > >>>> partition is created or re-assigned.
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > > 1.2 If an error is received for a follower, does
>> the
>> > > > > > controller
>> > > > > > > > >>>>> eagerly
>> > > > > > > > >>>>> > > remove it from ISR or do we just let the leader
>> > removes
>> > > > it
>> > > > > > > after
>> > > > > > > > >>>>> timeout?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > No, Controller will not actively remove it from ISR.
>> > But
>> > > > > > > controller
>> > > > > > > > >>>>> will
>> > > > > > > > >>>>> > recognize it as offline replica and propagate this
>> > > > > information
>> > > > > > to
>> > > > > > > > all
>> > > > > > > > >>>>> > brokers via UpdateMetadataRequest. Each leader can
>> use
>> > > this
>> > > > > > > > >>>>> information to
>> > > > > > > > >>>>> > actively remove offline replica from ISR set. I have
>> > > > updated
>> > > > > to
>> > > > > > > > wiki
>> > > > > > > > >>>>> to
>> > > > > > > > >>>>> > clarify it.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> That seems inconsistent with how the controller deals
>> > with
>> > > > > > offline
>> > > > > > > > >>>>> replicas
>> > > > > > > > >>>>> due to broker failures. When that happens, the broker
>> > will
>> > > > (1)
>> > > > > > > select
>> > > > > > > > >>>>> a new
>> > > > > > > > >>>>> leader if the offline replica is the leader; (2)
>> remove
>> > the
>> > > > > > replica
>> > > > > > > > >>>>> from
>> > > > > > > > >>>>> ISR if the offline replica is the follower. So,
>> > > intuitively,
>> > > > it
>> > > > > > > seems
>> > > > > > > > >>>>> that
>> > > > > > > > >>>>> we should be doing the same thing when dealing with
>> > offline
>> > > > > > > replicas
>> > > > > > > > >>>>> due to
>> > > > > > > > >>>>> disk failure.
>> > > > > > > > >>>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>> My bad. I misunderstand how the controller currently
>> > handles
>> > > > > > broker
>> > > > > > > > >>>> failure and ISR change. Yes we should do the same thing
>> > when
>> > > > > > dealing
>> > > > > > > > with
>> > > > > > > > >>>> offline replicas here. I have updated the KIP to
>> specify
>> > > that,
>> > > > > > when
>> > > > > > > an
>> > > > > > > > >>>> offline replica is discovered by controller, the
>> > controller
>> > > > > > removes
>> > > > > > > > offline
>> > > > > > > > >>>> replicas from ISR in the ZK and sends
>> LeaderAndIsrRequest
>> > > with
>> > > > > > > > updated ISR
>> > > > > > > > >>>> to be used by partition leaders.
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > > 1.3 Similar, if an error is received for a leader,
>> > > should
>> > > > > the
>> > > > > > > > >>>>> controller
>> > > > > > > > >>>>> > > trigger leader election again?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > Yes, controller will trigger leader election if
>> leader
>> > > > > replica
>> > > > > > is
>> > > > > > > > >>>>> offline.
>> > > > > > > > >>>>> > I have updated the wiki to clarify it.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > 2. A log directory stops working on a broker
>> during
>> > > > > runtime:
>> > > > > > > > >>>>> > > 2.1 It seems the broker remembers the failed
>> > directory
>> > > > > after
>> > > > > > > > >>>>> hitting an
>> > > > > > > > >>>>> > > IOException and the failed directory won't be used
>> > for
>> > > > > > creating
>> > > > > > > > new
>> > > > > > > > >>>>> > > partitions until the broker is restarted? If so,
>> > could
>> > > > you
>> > > > > > add
>> > > > > > > > >>>>> that to
>> > > > > > > > >>>>> > the
>> > > > > > > > >>>>> > > wiki.
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > Right, broker assumes a log directory to be good
>> after
>> > it
>> > > > > > starts,
>> > > > > > > > >>>>> and mark
>> > > > > > > > >>>>> > log directory as bad once there is IOException when
>> > > broker
>> > > > > > > attempts
>> > > > > > > > >>>>> to
>> > > > > > > > >>>>> > access the log directory. New replicas will only be
>> > > created
>> > > > > on
>> > > > > > > good
>> > > > > > > > >>>>> log
>> > > > > > > > >>>>> > directory. I just added this to the KIP.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > > 2.2 Could you be a bit more specific on how and
>> > during
>> > > > > which
>> > > > > > > > >>>>> operation
>> > > > > > > > >>>>> > the
>> > > > > > > > >>>>> > > broker detects directory failure? Is it when the
>> > broker
>> > > > > hits
>> > > > > > an
>> > > > > > > > >>>>> > IOException
>> > > > > > > > >>>>> > > during writes, or both reads and writes?  For
>> > example,
>> > > > > during
>> > > > > > > > >>>>> broker
>> > > > > > > > >>>>> > > startup, it only reads from each of the log
>> > > directories,
>> > > > if
>> > > > > > it
>> > > > > > > > >>>>> hits an
>> > > > > > > > >>>>> > > IOException there, does the broker immediately
>> mark
>> > the
>> > > > > > > directory
>> > > > > > > > >>>>> as
>> > > > > > > > >>>>> > > offline?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > Broker marks log directory as bad once there is
>> > > IOException
>> > > > > > when
>> > > > > > > > >>>>> broker
>> > > > > > > > >>>>> > attempts to access the log directory. This includes
>> > read
>> > > > and
>> > > > > > > write.
>> > > > > > > > >>>>> These
>> > > > > > > > >>>>> > operations include log append, log read, log
>> cleaning,
>> > > > > > watermark
>> > > > > > > > >>>>> checkpoint
>> > > > > > > > >>>>> > etc. If broker hits IOException when it reads from
>> each
>> > > of
>> > > > > the
>> > > > > > > log
>> > > > > > > > >>>>> > directory during startup, it immediately mark the
>> > > directory
>> > > > > as
>> > > > > > > > >>>>> offline.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > I just updated the KIP to clarify it.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > > 3. Partition reassignment: If we know a replica is
>> > > > offline,
>> > > > > > do
>> > > > > > > we
>> > > > > > > > >>>>> still
>> > > > > > > > >>>>> > > want to send StopReplicaRequest to it?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > No, controller doesn't send StopReplicaRequest for
>> an
>> > > > offline
>> > > > > > > > >>>>> replica.
>> > > > > > > > >>>>> > Controller treats this scenario in the same way that
>> > > > exiting
>> > > > > > > Kafka
>> > > > > > > > >>>>> > implementation does when the broker of this replica
>> is
>> > > > > offline.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > 4. UpdateMetadataRequestPartitionState: For
>> > > > > > offline_replicas,
>> > > > > > > do
>> > > > > > > > >>>>> they
>> > > > > > > > >>>>> > only
>> > > > > > > > >>>>> > > include offline replicas due to log directory
>> > failures
>> > > or
>> > > > > do
>> > > > > > > they
>> > > > > > > > >>>>> also
>> > > > > > > > >>>>> > > include offline replicas due to broker failure?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > UpdateMetadataRequestPartitionState's
>> offline_replicas
>> > > > > include
>> > > > > > > > >>>>> offline
>> > > > > > > > >>>>> > replicas due to both log directory failure and
>> broker
>> > > > > failure.
>> > > > > > > This
>> > > > > > > > >>>>> is to
>> > > > > > > > >>>>> > make the semantics of this field easier to
>> understand.
>> > > > Broker
>> > > > > > can
>> > > > > > > > >>>>> > distinguish whether a replica is offline due to
>> broker
>> > > > > failure
>> > > > > > or
>> > > > > > > > >>>>> disk
>> > > > > > > > >>>>> > failure by checking whether a broker is live in the
>> > > > > > > > >>>>> UpdateMetadataRequest.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > 5. Tools: Could we add some kind of support in the
>> > tool
>> > > > to
>> > > > > > list
>> > > > > > > > >>>>> offline
>> > > > > > > > >>>>> > > directories?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > In KIP-112 we don't have tools to list offline
>> > > directories
>> > > > > > > because
>> > > > > > > > >>>>> we have
>> > > > > > > > >>>>> > intentionally avoided exposing log directory
>> > information
>> > > > > (e.g.
>> > > > > > > log
>> > > > > > > > >>>>> > directory path) to user or other brokers. I think we
>> > can
>> > > > add
>> > > > > > this
>> > > > > > > > >>>>> feature
>> > > > > > > > >>>>> > in KIP-113, in which we will have
>> DescribeDirsRequest
>> > to
>> > > > list
>> > > > > > log
>> > > > > > > > >>>>> directory
>> > > > > > > > >>>>> > information (e.g. partition assignment, path, size)
>> > > needed
>> > > > > for
>> > > > > > > > >>>>> rebalance.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> Since we are introducing a new failure mode, if a
>> replica
>> > > > > becomes
>> > > > > > > > >>>>> offline
>> > > > > > > > >>>>> due to failure in log directories, the first thing an
>> > admin
>> > > > > wants
>> > > > > > > to
>> > > > > > > > >>>>> know
>> > > > > > > > >>>>> is which log directories are offline from the broker's
>> > > > > > perspective.
>> > > > > > > > >>>>> So,
>> > > > > > > > >>>>> including such a tool will be useful. Do you plan to
>> do
>> > > > KIP-112
>> > > > > > and
>> > > > > > > > >>>>> KIP-113
>> > > > > > > > >>>>>  in the same release?
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>> Yes, I agree that including such a tool is using. This
>> is
>> > > > > probably
>> > > > > > > > >>>> better to be added in KIP-113 because we need
>> > > > > DescribeDirsRequest
>> > > > > > to
>> > > > > > > > get
>> > > > > > > > >>>> this information. I will update KIP-113 to include this
>> > > tool.
>> > > > > > > > >>>>
>> > > > > > > > >>>> I plan to do KIP-112 and KIP-113 separately to make
>> each
>> > KIP
>> > > > and
>> > > > > > > their
>> > > > > > > > >>>> patch easier to review. I don't have any plan about
>> which
>> > > > > release
>> > > > > > to
>> > > > > > > > have
>> > > > > > > > >>>> these KIPs. My plan is to both of them ASAP. Is there
>> > > > particular
>> > > > > > > > timeline
>> > > > > > > > >>>> you prefer for code of these two KIPs to checked-in?
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > 6. Metrics: Could we add some metrics to show
>> offline
>> > > > > > > > directories?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > Sure. I think it makes sense to have each broker
>> report
>> > > its
>> > > > > > > number
>> > > > > > > > of
>> > > > > > > > >>>>> > offline replicas and offline log directories. The
>> > > previous
>> > > > > > metric
>> > > > > > > > >>>>> was put
>> > > > > > > > >>>>> > in KIP-113. I just added both metrics in KIP-112.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > 7. There are still references to
>> kafka-log-dirs.sh.
>> > Are
>> > > > > they
>> > > > > > > > valid?
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > My bad. I just removed this from "Changes in
>> > Operational
>> > > > > > > > Procedures"
>> > > > > > > > >>>>> and
>> > > > > > > > >>>>> > "Test Plan" in the KIP.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > 8. Do you think KIP-113 is ready for review? One
>> > thing
>> > > > that
>> > > > > > > > KIP-113
>> > > > > > > > >>>>> > > mentions during partition reassignment is to first
>> > send
>> > > > > > > > >>>>> > > LeaderAndIsrRequest, followed by
>> > > ChangeReplicaDirRequest.
>> > > > > It
>> > > > > > > > seems
>> > > > > > > > >>>>> it's
>> > > > > > > > >>>>> > > better if the replicas are created in the right
>> log
>> > > > > directory
>> > > > > > > in
>> > > > > > > > >>>>> the
>> > > > > > > > >>>>> > first
>> > > > > > > > >>>>> > > place? The reason that I brought it up here is
>> > because
>> > > it
>> > > > > may
>> > > > > > > > >>>>> affect the
>> > > > > > > > >>>>> > > protocol of LeaderAndIsrRequest.
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > Yes, KIP-113 is ready for review. The advantage of
>> the
>> > > > > current
>> > > > > > > > >>>>> design is
>> > > > > > > > >>>>> > that we can keep LeaderAndIsrRequest
>> > > > log-direcotry-agnostic.
>> > > > > > The
>> > > > > > > > >>>>> > implementation would be much easier to read if all
>> log
>> > > > > related
>> > > > > > > > logic
>> > > > > > > > >>>>> (e.g.
>> > > > > > > > >>>>> > various errors) are put in ChangeReplicadIRrequest
>> and
>> > > the
>> > > > > code
>> > > > > > > > path
>> > > > > > > > >>>>> of
>> > > > > > > > >>>>> > handling replica movement is separated from
>> leadership
>> > > > > > handling.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > In other words, I think Kafka may be easier to
>> develop
>> > in
>> > > > the
>> > > > > > > long
>> > > > > > > > >>>>> term if
>> > > > > > > > >>>>> > we separate these two requests.
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> > I agree that ideally we want to create replicas in
>> the
>> > > > right
>> > > > > > log
>> > > > > > > > >>>>> directory
>> > > > > > > > >>>>> > in the first place. But I am not sure if there is
>> any
>> > > > > > performance
>> > > > > > > > or
>> > > > > > > > >>>>> > correctness concern with the existing way of moving
>> it
>> > > > after
>> > > > > it
>> > > > > > > is
>> > > > > > > > >>>>> created.
>> > > > > > > > >>>>> > Besides, does this decision affect the change
>> proposed
>> > in
>> > > > > > > KIP-112?
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>> I am just wondering if you have considered including
>> the
>> > > log
>> > > > > > > > directory
>> > > > > > > > >>>>> for
>> > > > > > > > >>>>> the replicas in the LeaderAndIsrRequest.
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>> Yeah I have thought about this idea, but only briefly.
>> I
>> > > > > rejected
>> > > > > > > this
>> > > > > > > > >>>> idea because log directory is broker's local
>> information
>> > > and I
>> > > > > > > prefer
>> > > > > > > > not
>> > > > > > > > >>>> to expose local config information to the cluster
>> through
>> > > > > > > > >>>> LeaderAndIsrRequest.
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>> 9. Could you describe when the offline replicas due to
>> > log
>> > > > > > > directory
>> > > > > > > > >>>>> failure are removed from the replica fetch threads?
>> > > > > > > > >>>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>> Yes. If the offline replica was a leader, either a new
>> > > leader
>> > > > is
>> > > > > > > > >>>> elected or all follower brokers will stop fetching for
>> > this
>> > > > > > > > partition. If
>> > > > > > > > >>>> the offline replica is a follower, the broker will stop
>> > > > fetching
>> > > > > > for
>> > > > > > > > this
>> > > > > > > > >>>> replica immediately. A broker stops fetching data for a
>> > > > replica
>> > > > > by
>> > > > > > > > removing
>> > > > > > > > >>>> the replica from the replica fetch threads. I have
>> updated
>> > > the
>> > > > > KIP
>> > > > > > > to
>> > > > > > > > >>>> clarify it.
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> 10. The wiki mentioned changing the log directory to a
>> > file
>> > > > for
>> > > > > > > > >>>>> simulating
>> > > > > > > > >>>>> disk failure in system tests. Could we just change the
>> > > > > permission
>> > > > > > > of
>> > > > > > > > >>>>> the
>> > > > > > > > >>>>> log directory to 000 to simulate that?
>> > > > > > > > >>>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>> Sure,
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> Thanks,
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> Jun
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> > > Jun
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > On Fri, Feb 10, 2017 at 9:53 AM, Dong Lin <
>> > > > > > lindon...@gmail.com
>> > > > > > > >
>> > > > > > > > >>>>> wrote:
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > > Hi Jun,
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > Can I replace zookeeper access with direct RPC
>> for
>> > > both
>> > > > > ISR
>> > > > > > > > >>>>> > notification
>> > > > > > > > >>>>> > > > and disk failure notification in a future KIP,
>> or
>> > do
>> > > > you
>> > > > > > feel
>> > > > > > > > we
>> > > > > > > > >>>>> should
>> > > > > > > > >>>>> > > do
>> > > > > > > > >>>>> > > > it in this KIP?
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > Hi Eno, Grant and everyone,
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > Is there further improvement you would like to
>> see
>> > > with
>> > > > > > this
>> > > > > > > > KIP?
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > Thanks you all for the comments,
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > Dong
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin <
>> > > > > > > lindon...@gmail.com>
>> > > > > > > > >>>>> wrote:
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > > On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe <
>> > > > > > > > >>>>> cmcc...@apache.org>
>> > > > > > > > >>>>> > > wrote:
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > >> On Thu, Feb 9, 2017, at 11:40, Dong Lin
>> wrote:
>> > > > > > > > >>>>> > > > >> > Thanks for all the comments Colin!
>> > > > > > > > >>>>> > > > >> >
>> > > > > > > > >>>>> > > > >> > To answer your questions:
>> > > > > > > > >>>>> > > > >> > - Yes, a broker will shutdown if all its
>> log
>> > > > > > directories
>> > > > > > > > >>>>> are bad.
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >> That makes sense.  Can you add this to the
>> > > writeup?
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > > Sure. This has already been added. You can
>> find
>> > it
>> > > > here
>> > > > > > > > >>>>> > > > > <https://cwiki.apache.org/confluence/pages/
>> > > > > > > diffpagesbyversio
>> > > > > > > > >>>>> n.action
>> > > > > > > > >>>>> > ?
>> > > > > > > > >>>>> > > > pageId=67638402&selectedPageVersions=9&
>> > > > > > selectedPageVersions=
>> > > > > > > > 10>
>> > > > > > > > >>>>> > > > > .
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >> > - I updated the KIP to explicitly state
>> that a
>> > > log
>> > > > > > > > >>>>> directory will
>> > > > > > > > >>>>> > be
>> > > > > > > > >>>>> > > > >> > assumed to be good until broker sees
>> > IOException
>> > > > > when
>> > > > > > it
>> > > > > > > > >>>>> tries to
>> > > > > > > > >>>>> > > > access
>> > > > > > > > >>>>> > > > >> > the log directory.
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >> Thanks.
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >> > - Controller doesn't explicitly know
>> whether
>> > > there
>> > > > > is
>> > > > > > > new
>> > > > > > > > >>>>> log
>> > > > > > > > >>>>> > > > directory
>> > > > > > > > >>>>> > > > >> > or
>> > > > > > > > >>>>> > > > >> > not. All controller knows is whether
>> replicas
>> > > are
>> > > > > > online
>> > > > > > > > or
>> > > > > > > > >>>>> > offline
>> > > > > > > > >>>>> > > > >> based
>> > > > > > > > >>>>> > > > >> > on LeaderAndIsrResponse. According to the
>> > > existing
>> > > > > > Kafka
>> > > > > > > > >>>>> > > > implementation,
>> > > > > > > > >>>>> > > > >> > controller will always send
>> > LeaderAndIsrRequest
>> > > > to a
>> > > > > > > > broker
>> > > > > > > > >>>>> after
>> > > > > > > > >>>>> > it
>> > > > > > > > >>>>> > > > >> > bounces.
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >> I thought so.  It's good to clarify,
>> though.  Do
>> > > you
>> > > > > > think
>> > > > > > > > >>>>> it's
>> > > > > > > > >>>>> > worth
>> > > > > > > > >>>>> > > > >> adding a quick discussion of this on the
>> wiki?
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > > Personally I don't think it is needed. If
>> broker
>> > > > starts
>> > > > > > > with
>> > > > > > > > >>>>> no bad
>> > > > > > > > >>>>> > log
>> > > > > > > > >>>>> > > > > directory, everything should work it is and we
>> > > should
>> > > > > not
>> > > > > > > > need
>> > > > > > > > >>>>> to
>> > > > > > > > >>>>> > > clarify
>> > > > > > > > >>>>> > > > > it. The KIP has already covered the scenario
>> > when a
>> > > > > > broker
>> > > > > > > > >>>>> starts
>> > > > > > > > >>>>> > with
>> > > > > > > > >>>>> > > > bad
>> > > > > > > > >>>>> > > > > log directory. Also, the KIP doesn't claim or
>> > hint
>> > > > that
>> > > > > > we
>> > > > > > > > >>>>> support
>> > > > > > > > >>>>> > > > dynamic
>> > > > > > > > >>>>> > > > > addition of new log directories. I think we
>> are
>> > > good.
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > >> best,
>> > > > > > > > >>>>> > > > >> Colin
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >> >
>> > > > > > > > >>>>> > > > >> > Please see this
>> > > > > > > > >>>>> > > > >> > <https://cwiki.apache.org/conf
>> > > > > > > > >>>>> luence/pages/diffpagesbyversio
>> > > > > > > > >>>>> > > > >> n.action?pageId=67638402&
>> > selectedPageVersions=9&
>> > > > > > > > >>>>> > > > selectedPageVersions=10>
>> > > > > > > > >>>>> > > > >> > for the change of the KIP.
>> > > > > > > > >>>>> > > > >> >
>> > > > > > > > >>>>> > > > >> > On Thu, Feb 9, 2017 at 11:04 AM, Colin
>> McCabe
>> > <
>> > > > > > > > >>>>> cmcc...@apache.org
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > > >> wrote:
>> > > > > > > > >>>>> > > > >> >
>> > > > > > > > >>>>> > > > >> > > On Thu, Feb 9, 2017, at 11:03, Colin
>> McCabe
>> > > > wrote:
>> > > > > > > > >>>>> > > > >> > > > Thanks, Dong L.
>> > > > > > > > >>>>> > > > >> > > >
>> > > > > > > > >>>>> > > > >> > > > Do we plan on bringing down the broker
>> > > process
>> > > > > > when
>> > > > > > > > all
>> > > > > > > > >>>>> log
>> > > > > > > > >>>>> > > > >> directories
>> > > > > > > > >>>>> > > > >> > > > are offline?
>> > > > > > > > >>>>> > > > >> > > >
>> > > > > > > > >>>>> > > > >> > > > Can you explicitly state on the KIP
>> that
>> > the
>> > > > log
>> > > > > > > dirs
>> > > > > > > > >>>>> are all
>> > > > > > > > >>>>> > > > >> considered
>> > > > > > > > >>>>> > > > >> > > > good after the broker process is
>> bounced?
>> > > It
>> > > > > > seems
>> > > > > > > > >>>>> like an
>> > > > > > > > >>>>> > > > >> important
>> > > > > > > > >>>>> > > > >> > > > thing to be clear about.  Also, perhaps
>> > > > discuss
>> > > > > > how
>> > > > > > > > the
>> > > > > > > > >>>>> > > controller
>> > > > > > > > >>>>> > > > >> > > > becomes aware of the newly good log
>> > > > directories
>> > > > > > > after
>> > > > > > > > a
>> > > > > > > > >>>>> broker
>> > > > > > > > >>>>> > > > >> bounce
>> > > > > > > > >>>>> > > > >> > > > (and whether this triggers
>> re-election).
>> > > > > > > > >>>>> > > > >> > >
>> > > > > > > > >>>>> > > > >> > > I meant to write, all the log dirs where
>> the
>> > > > > broker
>> > > > > > > can
>> > > > > > > > >>>>> still
>> > > > > > > > >>>>> > read
>> > > > > > > > >>>>> > > > the
>> > > > > > > > >>>>> > > > >> > > index and some other files.  Clearly, log
>> > dirs
>> > > > > that
>> > > > > > > are
>> > > > > > > > >>>>> > completely
>> > > > > > > > >>>>> > > > >> > > inaccessible will still be considered bad
>> > > after
>> > > > a
>> > > > > > > broker
>> > > > > > > > >>>>> process
>> > > > > > > > >>>>> > > > >> bounce.
>> > > > > > > > >>>>> > > > >> > >
>> > > > > > > > >>>>> > > > >> > > best,
>> > > > > > > > >>>>> > > > >> > > Colin
>> > > > > > > > >>>>> > > > >> > >
>> > > > > > > > >>>>> > > > >> > > >
>> > > > > > > > >>>>> > > > >> > > > +1 (non-binding) aside from that
>> > > > > > > > >>>>> > > > >> > > >
>> > > > > > > > >>>>> > > > >> > > >
>> > > > > > > > >>>>> > > > >> > > >
>> > > > > > > > >>>>> > > > >> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin
>> > > wrote:
>> > > > > > > > >>>>> > > > >> > > > > Hi all,
>> > > > > > > > >>>>> > > > >> > > > >
>> > > > > > > > >>>>> > > > >> > > > > Thank you all for the helpful
>> > suggestion.
>> > > I
>> > > > > have
>> > > > > > > > >>>>> updated the
>> > > > > > > > >>>>> > > KIP
>> > > > > > > > >>>>> > > > >> to
>> > > > > > > > >>>>> > > > >> > > > > address
>> > > > > > > > >>>>> > > > >> > > > > the comments received so far. See
>> here
>> > > > > > > > >>>>> > > > >> > > > > <https://cwiki.apache.org/conf
>> > > > > > > > >>>>> > luence/pages/diffpagesbyversio
>> > > > > > > > >>>>> > > > >> n.action?
>> > > > > > > > >>>>> > > > >> > > pageId=67638402&selectedPageVe
>> > > > > > > > >>>>> rsions=8&selectedPageVersions=
>> > > > > > > > >>>>> > 9>to
>> > > > > > > > >>>>> > > > >> > > > > read the changes of the KIP. Here is
>> a
>> > > > summary
>> > > > > > of
>> > > > > > > > >>>>> change:
>> > > > > > > > >>>>> > > > >> > > > >
>> > > > > > > > >>>>> > > > >> > > > > - Updated the Proposed Change
>> section to
>> > > > > change
>> > > > > > > the
>> > > > > > > > >>>>> recovery
>> > > > > > > > >>>>> > > > >> steps.
>> > > > > > > > >>>>> > > > >> > > After
>> > > > > > > > >>>>> > > > >> > > > > this change, broker will also create
>> > > replica
>> > > > > as
>> > > > > > > long
>> > > > > > > > >>>>> as all
>> > > > > > > > >>>>> > > log
>> > > > > > > > >>>>> > > > >> > > > > directories
>> > > > > > > > >>>>> > > > >> > > > > are working.
>> > > > > > > > >>>>> > > > >> > > > > - Removed kafka-log-dirs.sh from this
>> > KIP
>> > > > > since
>> > > > > > > user
>> > > > > > > > >>>>> no
>> > > > > > > > >>>>> > longer
>> > > > > > > > >>>>> > > > >> needs to
>> > > > > > > > >>>>> > > > >> > > > > use
>> > > > > > > > >>>>> > > > >> > > > > it for recovery from bad disks.
>> > > > > > > > >>>>> > > > >> > > > > - Explained how the znode
>> > > > > > controller_managed_state
>> > > > > > > > is
>> > > > > > > > >>>>> > managed
>> > > > > > > > >>>>> > > in
>> > > > > > > > >>>>> > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > Public
>> > > > > > > > >>>>> > > > >> > > > > interface section.
>> > > > > > > > >>>>> > > > >> > > > > - Explained what happens during
>> > controller
>> > > > > > > failover,
>> > > > > > > > >>>>> > partition
>> > > > > > > > >>>>> > > > >> > > > > reassignment
>> > > > > > > > >>>>> > > > >> > > > > and topic deletion in the Proposed
>> > Change
>> > > > > > section.
>> > > > > > > > >>>>> > > > >> > > > > - Updated Future Work section to
>> include
>> > > the
>> > > > > > > > following
>> > > > > > > > >>>>> > > potential
>> > > > > > > > >>>>> > > > >> > > > > improvements
>> > > > > > > > >>>>> > > > >> > > > >   - Let broker notify controller of
>> ISR
>> > > > change
>> > > > > > and
>> > > > > > > > >>>>> disk
>> > > > > > > > >>>>> > state
>> > > > > > > > >>>>> > > > >> change
>> > > > > > > > >>>>> > > > >> > > via
>> > > > > > > > >>>>> > > > >> > > > > RPC instead of using zookeeper
>> > > > > > > > >>>>> > > > >> > > > >   - Handle various failure scenarios
>> > (e.g.
>> > > > > slow
>> > > > > > > > disk)
>> > > > > > > > >>>>> on a
>> > > > > > > > >>>>> > > > >> case-by-case
>> > > > > > > > >>>>> > > > >> > > > > basis. For example, we may want to
>> > detect
>> > > > slow
>> > > > > > > disk
>> > > > > > > > >>>>> and
>> > > > > > > > >>>>> > > consider
>> > > > > > > > >>>>> > > > >> it as
>> > > > > > > > >>>>> > > > >> > > > > offline.
>> > > > > > > > >>>>> > > > >> > > > >   - Allow admin to mark a directory
>> as
>> > bad
>> > > > so
>> > > > > > that
>> > > > > > > > it
>> > > > > > > > >>>>> will
>> > > > > > > > >>>>> > not
>> > > > > > > > >>>>> > > > be
>> > > > > > > > >>>>> > > > >> used.
>> > > > > > > > >>>>> > > > >> > > > >
>> > > > > > > > >>>>> > > > >> > > > > Thanks,
>> > > > > > > > >>>>> > > > >> > > > > Dong
>> > > > > > > > >>>>> > > > >> > > > >
>> > > > > > > > >>>>> > > > >> > > > >
>> > > > > > > > >>>>> > > > >> > > > >
>> > > > > > > > >>>>> > > > >> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong
>> > Lin <
>> > > > > > > > >>>>> > lindon...@gmail.com
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > >> wrote:
>> > > > > > > > >>>>> > > > >> > > > >
>> > > > > > > > >>>>> > > > >> > > > > > Hey Eno,
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > Thanks much for the comment!
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > I still think the complexity added
>> to
>> > > > Kafka
>> > > > > is
>> > > > > > > > >>>>> justified
>> > > > > > > > >>>>> > by
>> > > > > > > > >>>>> > > > its
>> > > > > > > > >>>>> > > > >> > > benefit.
>> > > > > > > > >>>>> > > > >> > > > > > Let me provide my reasons below.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > 1) The additional logic is easy to
>> > > > > understand
>> > > > > > > and
>> > > > > > > > >>>>> thus its
>> > > > > > > > >>>>> > > > >> complexity
>> > > > > > > > >>>>> > > > >> > > > > > should be reasonable.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > On the broker side, it needs to
>> catch
>> > > > > > exception
>> > > > > > > > when
>> > > > > > > > >>>>> > access
>> > > > > > > > >>>>> > > > log
>> > > > > > > > >>>>> > > > >> > > directory,
>> > > > > > > > >>>>> > > > >> > > > > > mark log directory and all its
>> > replicas
>> > > as
>> > > > > > > > offline,
>> > > > > > > > >>>>> notify
>> > > > > > > > >>>>> > > > >> > > controller by
>> > > > > > > > >>>>> > > > >> > > > > > writing the zookeeper notification
>> > path,
>> > > > and
>> > > > > > > > >>>>> specify error
>> > > > > > > > >>>>> > > in
>> > > > > > > > >>>>> > > > >> > > > > > LeaderAndIsrResponse. On the
>> > controller
>> > > > > side,
>> > > > > > it
>> > > > > > > > >>>>> will
>> > > > > > > > >>>>> > > listener
>> > > > > > > > >>>>> > > > >> to
>> > > > > > > > >>>>> > > > >> > > > > > zookeeper for disk failure
>> > notification,
>> > > > > learn
>> > > > > > > > about
>> > > > > > > > >>>>> > offline
>> > > > > > > > >>>>> > > > >> > > replicas in
>> > > > > > > > >>>>> > > > >> > > > > > the LeaderAndIsrResponse, and take
>> > > offline
>> > > > > > > > replicas
>> > > > > > > > >>>>> into
>> > > > > > > > >>>>> > > > >> > > consideration when
>> > > > > > > > >>>>> > > > >> > > > > > electing leaders. It also mark
>> replica
>> > > as
>> > > > > > > created
>> > > > > > > > in
>> > > > > > > > >>>>> > > zookeeper
>> > > > > > > > >>>>> > > > >> and
>> > > > > > > > >>>>> > > > >> > > use it
>> > > > > > > > >>>>> > > > >> > > > > > to determine whether a replica is
>> > > created.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > That is all the logic we need to
>> add
>> > in
>> > > > > > Kafka. I
>> > > > > > > > >>>>> > personally
>> > > > > > > > >>>>> > > > feel
>> > > > > > > > >>>>> > > > >> > > this is
>> > > > > > > > >>>>> > > > >> > > > > > easy to reason about.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > 2) The additional code is not much.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > I expect the code for KIP-112 to be
>> > > around
>> > > > > > 1100
>> > > > > > > > >>>>> lines new
>> > > > > > > > >>>>> > > > code.
>> > > > > > > > >>>>> > > > >> > > Previously
>> > > > > > > > >>>>> > > > >> > > > > > I have implemented a prototype of a
>> > > > slightly
>> > > > > > > > >>>>> different
>> > > > > > > > >>>>> > > design
>> > > > > > > > >>>>> > > > >> (see
>> > > > > > > > >>>>> > > > >> > > here
>> > > > > > > > >>>>> > > > >> > > > > > <https://docs.google.com/docum
>> > > > > > > > >>>>> ent/d/1Izza0SBmZMVUBUt9s_
>> > > > > > > > >>>>> > > > >> > > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
>> > > > > > > > >>>>> > > > >> > > > > > and uploaded it to github (see here
>> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/lindong28/
>> > > > > kafka/tree/JBOD
>> > > > > > > >).
>> > > > > > > > >>>>> The
>> > > > > > > > >>>>> > patch
>> > > > > > > > >>>>> > > > >> changed
>> > > > > > > > >>>>> > > > >> > > 33
>> > > > > > > > >>>>> > > > >> > > > > > files, added 1185 lines and deleted
>> > 183
>> > > > > lines.
>> > > > > > > The
>> > > > > > > > >>>>> size of
>> > > > > > > > >>>>> > > > >> prototype
>> > > > > > > > >>>>> > > > >> > > patch
>> > > > > > > > >>>>> > > > >> > > > > > is actually smaller than patch of
>> > > KIP-107
>> > > > > (see
>> > > > > > > > here
>> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/apache/
>> > > > kafka/pull/2476
>> > > > > >)
>> > > > > > > > which
>> > > > > > > > >>>>> is
>> > > > > > > > >>>>> > > already
>> > > > > > > > >>>>> > > > >> > > accepted.
>> > > > > > > > >>>>> > > > >> > > > > > The KIP-107 patch changed 49 files,
>> > > added
>> > > > > 1349
>> > > > > > > > >>>>> lines and
>> > > > > > > > >>>>> > > > >> deleted 141
>> > > > > > > > >>>>> > > > >> > > lines.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > 3) Comparison with
>> > > > > > > one-broker-per-multiple-volume
>> > > > > > > > s
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > This KIP can improve the
>> availability
>> > of
>> > > > > Kafka
>> > > > > > > in
>> > > > > > > > >>>>> this
>> > > > > > > > >>>>> > case
>> > > > > > > > >>>>> > > > such
>> > > > > > > > >>>>> > > > >> > > that one
>> > > > > > > > >>>>> > > > >> > > > > > failed volume doesn't bring down
>> the
>> > > > entire
>> > > > > > > > broker.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > 4) Comparison with
>> > one-broker-per-volume
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to multiple
>> disks,
>> > > > then
>> > > > > we
>> > > > > > > > >>>>> still have
>> > > > > > > > >>>>> > > > >> similar
>> > > > > > > > >>>>> > > > >> > > problem
>> > > > > > > > >>>>> > > > >> > > > > > such that the broker will fail if
>> any
>> > > disk
>> > > > > of
>> > > > > > > the
>> > > > > > > > >>>>> volume
>> > > > > > > > >>>>> > > > failed.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to one disk, it
>> > > means
>> > > > > that
>> > > > > > > we
>> > > > > > > > >>>>> need to
>> > > > > > > > >>>>> > > > >> deploy 10
>> > > > > > > > >>>>> > > > >> > > > > > brokers on a machine if the machine
>> > has
>> > > 10
>> > > > > > > disks.
>> > > > > > > > I
>> > > > > > > > >>>>> will
>> > > > > > > > >>>>> > > > >> explain the
>> > > > > > > > >>>>> > > > >> > > > > > concern with this approach in
>> order of
>> > > > their
>> > > > > > > > >>>>> importance.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > - It is weird if we were to tell
>> kafka
>> > > > user
>> > > > > to
>> > > > > > > > >>>>> deploy 50
>> > > > > > > > >>>>> > > > >> brokers on a
>> > > > > > > > >>>>> > > > >> > > > > > machine of 50 disks.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > - Either when user deploys Kafka
>> on a
>> > > > > > commercial
>> > > > > > > > >>>>> cloud
>> > > > > > > > >>>>> > > > platform
>> > > > > > > > >>>>> > > > >> or
>> > > > > > > > >>>>> > > > >> > > when
>> > > > > > > > >>>>> > > > >> > > > > > user deploys their own cluster, the
>> > size
>> > > > or
>> > > > > > > > largest
>> > > > > > > > >>>>> disk
>> > > > > > > > >>>>> > is
>> > > > > > > > >>>>> > > > >> usually
>> > > > > > > > >>>>> > > > >> > > > > > limited. There will be scenarios
>> where
>> > > > user
>> > > > > > want
>> > > > > > > > to
>> > > > > > > > >>>>> > increase
>> > > > > > > > >>>>> > > > >> broker
>> > > > > > > > >>>>> > > > >> > > > > > capacity by having multiple disks
>> per
>> > > > > broker.
>> > > > > > > This
>> > > > > > > > >>>>> JBOD
>> > > > > > > > >>>>> > KIP
>> > > > > > > > >>>>> > > > >> makes it
>> > > > > > > > >>>>> > > > >> > > > > > feasible without hurting
>> availability
>> > > due
>> > > > to
>> > > > > > > > single
>> > > > > > > > >>>>> disk
>> > > > > > > > >>>>> > > > >> failure.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > - Automatic load rebalance across
>> > disks
>> > > > will
>> > > > > > be
>> > > > > > > > >>>>> easier and
>> > > > > > > > >>>>> > > > more
>> > > > > > > > >>>>> > > > >> > > flexible
>> > > > > > > > >>>>> > > > >> > > > > > if one broker has multiple disks.
>> This
>> > > can
>> > > > > be
>> > > > > > > > >>>>> future work.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > - There is performance concern when
>> > you
>> > > > > deploy
>> > > > > > > 10
>> > > > > > > > >>>>> broker
>> > > > > > > > >>>>> > > vs. 1
>> > > > > > > > >>>>> > > > >> > > broker on
>> > > > > > > > >>>>> > > > >> > > > > > one machine. The metadata the
>> cluster,
>> > > > > > including
>> > > > > > > > >>>>> > > FetchRequest,
>> > > > > > > > >>>>> > > > >> > > > > > ProduceResponse, MetadataRequest
>> and
>> > so
>> > > on
>> > > > > > will
>> > > > > > > > all
>> > > > > > > > >>>>> be 10X
>> > > > > > > > >>>>> > > > >> more. The
>> > > > > > > > >>>>> > > > >> > > > > > packet-per-second will be 10X
>> higher
>> > > which
>> > > > > may
>> > > > > > > > limit
>> > > > > > > > >>>>> > > > >> performance if
>> > > > > > > > >>>>> > > > >> > > pps is
>> > > > > > > > >>>>> > > > >> > > > > > the performance bottleneck. The
>> number
>> > > of
>> > > > > > socket
>> > > > > > > > on
>> > > > > > > > >>>>> the
>> > > > > > > > >>>>> > > > machine
>> > > > > > > > >>>>> > > > >> is
>> > > > > > > > >>>>> > > > >> > > 10X
>> > > > > > > > >>>>> > > > >> > > > > > higher. And the number of
>> replication
>> > > > thread
>> > > > > > > will
>> > > > > > > > >>>>> be 100X
>> > > > > > > > >>>>> > > > more.
>> > > > > > > > >>>>> > > > >> The
>> > > > > > > > >>>>> > > > >> > > impact
>> > > > > > > > >>>>> > > > >> > > > > > will be more significant with
>> > increasing
>> > > > > > number
>> > > > > > > of
>> > > > > > > > >>>>> disks
>> > > > > > > > >>>>> > per
>> > > > > > > > >>>>> > > > >> > > machine. Thus
>> > > > > > > > >>>>> > > > >> > > > > > it will limit Kakfa's scalability
>> in
>> > the
>> > > > > long
>> > > > > > > > term.
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > Thanks,
>> > > > > > > > >>>>> > > > >> > > > > > Dong
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > > On Tue, Feb 7, 2017 at 1:51 AM, Eno
>> > > > > Thereska <
>> > > > > > > > >>>>> > > > >> eno.there...@gmail.com
>> > > > > > > > >>>>> > > > >> > > >
>> > > > > > > > >>>>> > > > >> > > > > > wrote:
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > > > > >> Hi Dong,
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >> To simplify the discussion today,
>> on
>> > my
>> > > > > part
>> > > > > > > I'll
>> > > > > > > > >>>>> zoom
>> > > > > > > > >>>>> > into
>> > > > > > > > >>>>> > > > one
>> > > > > > > > >>>>> > > > >> > > thing
>> > > > > > > > >>>>> > > > >> > > > > >> only:
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >> - I'll discuss the options called
>> > > below :
>> > > > > > > > >>>>> > > > >> "one-broker-per-disk" or
>> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks".
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >> - I completely buy the JBOD vs
>> RAID
>> > > > > arguments
>> > > > > > > so
>> > > > > > > > >>>>> there is
>> > > > > > > > >>>>> > > no
>> > > > > > > > >>>>> > > > >> need to
>> > > > > > > > >>>>> > > > >> > > > > >> discuss that part for me. I buy it
>> > that
>> > > > > JBODs
>> > > > > > > are
>> > > > > > > > >>>>> good.
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >> I find the terminology can be
>> > improved
>> > > a
>> > > > > bit.
>> > > > > > > > >>>>> Ideally
>> > > > > > > > >>>>> > we'd
>> > > > > > > > >>>>> > > be
>> > > > > > > > >>>>> > > > >> > > talking
>> > > > > > > > >>>>> > > > >> > > > > >> about volumes, not disks. Just to
>> > make
>> > > it
>> > > > > > clear
>> > > > > > > > >>>>> that
>> > > > > > > > >>>>> > Kafka
>> > > > > > > > >>>>> > > > >> > > understand
>> > > > > > > > >>>>> > > > >> > > > > >> volumes/directories, not
>> individual
>> > raw
>> > > > > > disks.
>> > > > > > > So
>> > > > > > > > >>>>> by
>> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks" what I
>> > mean
>> > > is
>> > > > > > that
>> > > > > > > > the
>> > > > > > > > >>>>> admin
>> > > > > > > > >>>>> > > can
>> > > > > > > > >>>>> > > > >> pool a
>> > > > > > > > >>>>> > > > >> > > few
>> > > > > > > > >>>>> > > > >> > > > > >> disks together to create a
>> > > > volume/directory
>> > > > > > and
>> > > > > > > > >>>>> give that
>> > > > > > > > >>>>> > > to
>> > > > > > > > >>>>> > > > >> Kafka.
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >> The kernel of my question will be
>> > that
>> > > > the
>> > > > > > > admin
>> > > > > > > > >>>>> already
>> > > > > > > > >>>>> > > has
>> > > > > > > > >>>>> > > > >> tools
>> > > > > > > > >>>>> > > > >> > > to 1)
>> > > > > > > > >>>>> > > > >> > > > > >> create volumes/directories from a
>> > JBOD
>> > > > and
>> > > > > 2)
>> > > > > > > > >>>>> start a
>> > > > > > > > >>>>> > > broker
>> > > > > > > > >>>>> > > > >> on a
>> > > > > > > > >>>>> > > > >> > > desired
>> > > > > > > > >>>>> > > > >> > > > > >> machine and 3) assign a broker
>> > > resources
>> > > > > > like a
>> > > > > > > > >>>>> > directory.
>> > > > > > > > >>>>> > > I
>> > > > > > > > >>>>> > > > >> claim
>> > > > > > > > >>>>> > > > >> > > that
>> > > > > > > > >>>>> > > > >> > > > > >> those tools are sufficient to
>> > optimise
>> > > > > > resource
>> > > > > > > > >>>>> > allocation.
>> > > > > > > > >>>>> > > > I
>> > > > > > > > >>>>> > > > >> > > understand
>> > > > > > > > >>>>> > > > >> > > > > >> that a broker could manage point
>> 3)
>> > > > itself,
>> > > > > > ie
>> > > > > > > > >>>>> juggle the
>> > > > > > > > >>>>> > > > >> > > directories. My
>> > > > > > > > >>>>> > > > >> > > > > >> question is whether the complexity
>> > > added
>> > > > to
>> > > > > > > Kafka
>> > > > > > > > >>>>> is
>> > > > > > > > >>>>> > > > justified.
>> > > > > > > > >>>>> > > > >> > > > > >> Operationally it seems to me an
>> admin
>> > > > will
>> > > > > > > still
>> > > > > > > > >>>>> have to
>> > > > > > > > >>>>> > do
>> > > > > > > > >>>>> > > > >> all the
>> > > > > > > > >>>>> > > > >> > > three
>> > > > > > > > >>>>> > > > >> > > > > >> items above.
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >> Looking forward to the discussion
>> > > > > > > > >>>>> > > > >> > > > > >> Thanks
>> > > > > > > > >>>>> > > > >> > > > > >> Eno
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >> > On 1 Feb 2017, at 17:21, Dong
>> Lin <
>> > > > > > > > >>>>> lindon...@gmail.com
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> > > > >> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > Hey Eno,
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > Thanks much for the review.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > I think your suggestion is to
>> split
>> > > > disks
>> > > > > > of
>> > > > > > > a
>> > > > > > > > >>>>> machine
>> > > > > > > > >>>>> > > into
>> > > > > > > > >>>>> > > > >> > > multiple
>> > > > > > > > >>>>> > > > >> > > > > >> disk
>> > > > > > > > >>>>> > > > >> > > > > >> > sets and run one broker per disk
>> > set.
>> > > > > Yeah
>> > > > > > > this
>> > > > > > > > >>>>> is
>> > > > > > > > >>>>> > > similar
>> > > > > > > > >>>>> > > > to
>> > > > > > > > >>>>> > > > >> > > Colin's
>> > > > > > > > >>>>> > > > >> > > > > >> > suggestion of
>> one-broker-per-disk,
>> > > > which
>> > > > > we
>> > > > > > > > have
>> > > > > > > > >>>>> > > evaluated
>> > > > > > > > >>>>> > > > at
>> > > > > > > > >>>>> > > > >> > > LinkedIn
>> > > > > > > > >>>>> > > > >> > > > > >> and
>> > > > > > > > >>>>> > > > >> > > > > >> > considered it to be a good short
>> > term
>> > > > > > > approach.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > As of now I don't think any of
>> > these
>> > > > > > approach
>> > > > > > > > is
>> > > > > > > > >>>>> a
>> > > > > > > > >>>>> > better
>> > > > > > > > >>>>> > > > >> > > alternative in
>> > > > > > > > >>>>> > > > >> > > > > >> > the long term. I will summarize
>> > these
>> > > > > > here. I
>> > > > > > > > >>>>> have put
>> > > > > > > > >>>>> > > > these
>> > > > > > > > >>>>> > > > >> > > reasons in
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> > KIP's motivation section and
>> > rejected
>> > > > > > > > alternative
>> > > > > > > > >>>>> > > section.
>> > > > > > > > >>>>> > > > I
>> > > > > > > > >>>>> > > > >> am
>> > > > > > > > >>>>> > > > >> > > happy to
>> > > > > > > > >>>>> > > > >> > > > > >> > discuss more and I would
>> certainly
>> > > like
>> > > > > to
>> > > > > > > use
>> > > > > > > > an
>> > > > > > > > >>>>> > > > alternative
>> > > > > > > > >>>>> > > > >> > > solution
>> > > > > > > > >>>>> > > > >> > > > > >> that
>> > > > > > > > >>>>> > > > >> > > > > >> > is easier to do with better
>> > > > performance.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-10: if we switch
>> > from
>> > > > > > RAID-10
>> > > > > > > > >>>>> with
>> > > > > > > > >>>>> > > > >> > > > > >> replication-factoer=2 to
>> > > > > > > > >>>>> > > > >> > > > > >> > JBOD with replicatio-factor=3,
>> we
>> > get
>> > > > 25%
>> > > > > > > > >>>>> reduction in
>> > > > > > > > >>>>> > > disk
>> > > > > > > > >>>>> > > > >> usage
>> > > > > > > > >>>>> > > > >> > > and
>> > > > > > > > >>>>> > > > >> > > > > >> > doubles the tolerance of broker
>> > > failure
>> > > > > > > before
>> > > > > > > > >>>>> data
>> > > > > > > > >>>>> > > > >> > > unavailability from
>> > > > > > > > >>>>> > > > >> > > > > >> 1
>> > > > > > > > >>>>> > > > >> > > > > >> > to 2. This is pretty huge gain
>> for
>> > > any
>> > > > > > > company
>> > > > > > > > >>>>> that
>> > > > > > > > >>>>> > uses
>> > > > > > > > >>>>> > > > >> Kafka at
>> > > > > > > > >>>>> > > > >> > > large
>> > > > > > > > >>>>> > > > >> > > > > >> > scale.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. one-broker-per-disk:
>> The
>> > > > > benefit
>> > > > > > > of
>> > > > > > > > >>>>> > > > >> > > one-broker-per-disk is
>> > > > > > > > >>>>> > > > >> > > > > >> that
>> > > > > > > > >>>>> > > > >> > > > > >> > no major code change is needed
>> in
>> > > > Kafka.
>> > > > > > > Among
>> > > > > > > > >>>>> the
>> > > > > > > > >>>>> > > > >> disadvantage of
>> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk summarized
>> in
>> > the
>> > > > KIP
>> > > > > > and
>> > > > > > > > >>>>> previous
>> > > > > > > > >>>>> > > > email
>> > > > > > > > >>>>> > > > >> with
>> > > > > > > > >>>>> > > > >> > > Colin,
>> > > > > > > > >>>>> > > > >> > > > > >> > the biggest one is the 15%
>> > throughput
>> > > > > loss
>> > > > > > > > >>>>> compared to
>> > > > > > > > >>>>> > > JBOD
>> > > > > > > > >>>>> > > > >> and
>> > > > > > > > >>>>> > > > >> > > less
>> > > > > > > > >>>>> > > > >> > > > > >> > flexibility to balance across
>> > disks.
>> > > > > > Further,
>> > > > > > > > it
>> > > > > > > > >>>>> > probably
>> > > > > > > > >>>>> > > > >> requires
>> > > > > > > > >>>>> > > > >> > > > > >> change
>> > > > > > > > >>>>> > > > >> > > > > >> > to internal deployment tools at
>> > > various
>> > > > > > > > >>>>> companies to
>> > > > > > > > >>>>> > deal
>> > > > > > > > >>>>> > > > >> with
>> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk setup.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-0: This is the
>> > setup
>> > > > that
>> > > > > > > used
>> > > > > > > > at
>> > > > > > > > >>>>> > > > Microsoft.
>> > > > > > > > >>>>> > > > >> The
>> > > > > > > > >>>>> > > > >> > > > > >> problem is
>> > > > > > > > >>>>> > > > >> > > > > >> > that a broker becomes
>> unavailable
>> > if
>> > > > any
>> > > > > > disk
>> > > > > > > > >>>>> fail.
>> > > > > > > > >>>>> > > Suppose
>> > > > > > > > >>>>> > > > >> > > > > >> > replication-factor=2 and there
>> are
>> > 10
>> > > > > disks
>> > > > > > > per
>> > > > > > > > >>>>> > machine.
>> > > > > > > > >>>>> > > > >> Then the
>> > > > > > > > >>>>> > > > >> > > > > >> > probability of of any message
>> > becomes
>> > > > > > > > >>>>> unavailable due
>> > > > > > > > >>>>> > to
>> > > > > > > > >>>>> > > > disk
>> > > > > > > > >>>>> > > > >> > > failure
>> > > > > > > > >>>>> > > > >> > > > > >> with
>> > > > > > > > >>>>> > > > >> > > > > >> > RAID-0 is 100X higher than that
>> > with
>> > > > > JBOD.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs.
>> > one-broker-per-few-disks:
>> > > > > > > > >>>>> > > > one-broker-per-few-disk
>> > > > > > > > >>>>> > > > >> is
>> > > > > > > > >>>>> > > > >> > > > > >> somewhere
>> > > > > > > > >>>>> > > > >> > > > > >> > between one-broker-per-disk and
>> > > RAID-0.
>> > > > > So
>> > > > > > it
>> > > > > > > > >>>>> carries
>> > > > > > > > >>>>> > an
>> > > > > > > > >>>>> > > > >> averaged
>> > > > > > > > >>>>> > > > >> > > > > >> > disadvantages of these two
>> > > approaches.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > To answer your question
>> regarding,
>> > I
>> > > > > think
>> > > > > > it
>> > > > > > > > is
>> > > > > > > > >>>>> > > reasonable
>> > > > > > > > >>>>> > > > >> to
>> > > > > > > > >>>>> > > > >> > > mange
>> > > > > > > > >>>>> > > > >> > > > > >> disk
>> > > > > > > > >>>>> > > > >> > > > > >> > in Kafka. By "managing disks" we
>> > mean
>> > > > the
>> > > > > > > > >>>>> management of
>> > > > > > > > >>>>> > > > >> > > assignment of
>> > > > > > > > >>>>> > > > >> > > > > >> > replicas across disks. Here are
>> my
>> > > > > reasons
>> > > > > > in
>> > > > > > > > >>>>> more
>> > > > > > > > >>>>> > > detail:
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > - I don't think this KIP is a
>> big
>> > > step
>> > > > > > > change.
>> > > > > > > > By
>> > > > > > > > >>>>> > > allowing
>> > > > > > > > >>>>> > > > >> user to
>> > > > > > > > >>>>> > > > >> > > > > >> > configure Kafka to run multiple
>> log
>> > > > > > > directories
>> > > > > > > > >>>>> or
>> > > > > > > > >>>>> > disks
>> > > > > > > > >>>>> > > as
>> > > > > > > > >>>>> > > > >> of
>> > > > > > > > >>>>> > > > >> > > now, it
>> > > > > > > > >>>>> > > > >> > > > > >> is
>> > > > > > > > >>>>> > > > >> > > > > >> > implicit that Kafka manages
>> disks.
>> > It
>> > > > is
>> > > > > > just
>> > > > > > > > >>>>> not a
>> > > > > > > > >>>>> > > > complete
>> > > > > > > > >>>>> > > > >> > > feature.
>> > > > > > > > >>>>> > > > >> > > > > >> > Microsoft and probably other
>> > > companies
>> > > > > are
>> > > > > > > > using
>> > > > > > > > >>>>> this
>> > > > > > > > >>>>> > > > feature
>> > > > > > > > >>>>> > > > >> > > under the
>> > > > > > > > >>>>> > > > >> > > > > >> > undesirable effect that a broker
>> > will
>> > > > > fail
>> > > > > > > any
>> > > > > > > > >>>>> if any
>> > > > > > > > >>>>> > > disk
>> > > > > > > > >>>>> > > > >> fail.
>> > > > > > > > >>>>> > > > >> > > It is
>> > > > > > > > >>>>> > > > >> > > > > >> good
>> > > > > > > > >>>>> > > > >> > > > > >> > to complete this feature.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > - I think it is reasonable to
>> > manage
>> > > > disk
>> > > > > > in
>> > > > > > > > >>>>> Kafka. One
>> > > > > > > > >>>>> > > of
>> > > > > > > > >>>>> > > > >> the
>> > > > > > > > >>>>> > > > >> > > most
>> > > > > > > > >>>>> > > > >> > > > > >> > important work that Kafka is
>> doing
>> > is
>> > > > to
>> > > > > > > > >>>>> determine the
>> > > > > > > > >>>>> > > > >> replica
>> > > > > > > > >>>>> > > > >> > > > > >> assignment
>> > > > > > > > >>>>> > > > >> > > > > >> > across brokers and make sure
>> enough
>> > > > > copies
>> > > > > > > of a
>> > > > > > > > >>>>> given
>> > > > > > > > >>>>> > > > >> replica is
>> > > > > > > > >>>>> > > > >> > > > > >> available.
>> > > > > > > > >>>>> > > > >> > > > > >> > I would argue that it is not
>> much
>> > > > > different
>> > > > > > > > than
>> > > > > > > > >>>>> > > > determining
>> > > > > > > > >>>>> > > > >> the
>> > > > > > > > >>>>> > > > >> > > replica
>> > > > > > > > >>>>> > > > >> > > > > >> > assignment across disk
>> > conceptually.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > - I would agree that this KIP is
>> > > > improve
>> > > > > > > > >>>>> performance of
>> > > > > > > > >>>>> > > > >> Kafka at
>> > > > > > > > >>>>> > > > >> > > the
>> > > > > > > > >>>>> > > > >> > > > > >> cost
>> > > > > > > > >>>>> > > > >> > > > > >> > of more complexity inside
>> Kafka, by
>> > > > > > switching
>> > > > > > > > >>>>> from
>> > > > > > > > >>>>> > > RAID-10
>> > > > > > > > >>>>> > > > to
>> > > > > > > > >>>>> > > > >> > > JBOD. I
>> > > > > > > > >>>>> > > > >> > > > > >> would
>> > > > > > > > >>>>> > > > >> > > > > >> > argue that this is a right
>> > direction.
>> > > > If
>> > > > > we
>> > > > > > > can
>> > > > > > > > >>>>> gain
>> > > > > > > > >>>>> > 20%+
>> > > > > > > > >>>>> > > > >> > > performance by
>> > > > > > > > >>>>> > > > >> > > > > >> > managing NIC in Kafka as
>> compared
>> > to
>> > > > > > existing
>> > > > > > > > >>>>> approach
>> > > > > > > > >>>>> > > and
>> > > > > > > > >>>>> > > > >> other
>> > > > > > > > >>>>> > > > >> > > > > >> > alternatives, I would say we
>> should
>> > > > just
>> > > > > do
>> > > > > > > it.
>> > > > > > > > >>>>> Such a
>> > > > > > > > >>>>> > > gain
>> > > > > > > > >>>>> > > > >> in
>> > > > > > > > >>>>> > > > >> > > > > >> performance,
>> > > > > > > > >>>>> > > > >> > > > > >> > or equivalently reduction in
>> cost,
>> > > can
>> > > > > save
>> > > > > > > > >>>>> millions of
>> > > > > > > > >>>>> > > > >> dollars
>> > > > > > > > >>>>> > > > >> > > per year
>> > > > > > > > >>>>> > > > >> > > > > >> > for any company running Kafka at
>> > > large
>> > > > > > scale.
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > Thanks,
>> > > > > > > > >>>>> > > > >> > > > > >> > Dong
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> > On Wed, Feb 1, 2017 at 5:41 AM,
>> Eno
>> > > > > > Thereska
>> > > > > > > <
>> > > > > > > > >>>>> > > > >> > > eno.there...@gmail.com>
>> > > > > > > > >>>>> > > > >> > > > > >> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >
>> > > > > > > > >>>>> > > > >> > > > > >> >> I'm coming somewhat late to the
>> > > > > > discussion,
>> > > > > > > > >>>>> apologies
>> > > > > > > > >>>>> > > for
>> > > > > > > > >>>>> > > > >> that.
>> > > > > > > > >>>>> > > > >> > > > > >> >>
>> > > > > > > > >>>>> > > > >> > > > > >> >> I'm worried about this
>> proposal.
>> > > It's
>> > > > > > moving
>> > > > > > > > >>>>> Kafka to
>> > > > > > > > >>>>> > a
>> > > > > > > > >>>>> > > > >> world
>> > > > > > > > >>>>> > > > >> > > where it
>> > > > > > > > >>>>> > > > >> > > > > >> >> manages disks. So in a sense,
>> the
>> > > > scope
>> > > > > of
>> > > > > > > the
>> > > > > > > > >>>>> KIP is
>> > > > > > > > >>>>> > > > >> limited,
>> > > > > > > > >>>>> > > > >> > > but the
>> > > > > > > > >>>>> > > > >> > > > > >> >> direction it sets for Kafka is
>> > > quite a
>> > > > > big
>> > > > > > > > step
>> > > > > > > > >>>>> > change.
>> > > > > > > > >>>>> > > > >> > > Fundamentally
>> > > > > > > > >>>>> > > > >> > > > > >> this
>> > > > > > > > >>>>> > > > >> > > > > >> >> is about balancing resources
>> for a
>> > > > Kafka
>> > > > > > > > >>>>> broker. This
>> > > > > > > > >>>>> > > can
>> > > > > > > > >>>>> > > > be
>> > > > > > > > >>>>> > > > >> > > done by a
>> > > > > > > > >>>>> > > > >> > > > > >> >> tool, rather than by changing
>> > Kafka.
>> > > > > E.g.,
>> > > > > > > the
>> > > > > > > > >>>>> tool
>> > > > > > > > >>>>> > > would
>> > > > > > > > >>>>> > > > >> take a
>> > > > > > > > >>>>> > > > >> > > bunch
>> > > > > > > > >>>>> > > > >> > > > > >> of
>> > > > > > > > >>>>> > > > >> > > > > >> >> disks together, create a volume
>> > over
>> > > > > them
>> > > > > > > and
>> > > > > > > > >>>>> export
>> > > > > > > > >>>>> > > that
>> > > > > > > > >>>>> > > > >> to a
>> > > > > > > > >>>>> > > > >> > > Kafka
>> > > > > > > > >>>>> > > > >> > > > > >> broker
>> > > > > > > > >>>>> > > > >> > > > > >> >> (in addition to setting the
>> memory
>> > > > > limits
>> > > > > > > for
>> > > > > > > > >>>>> that
>> > > > > > > > >>>>> > > broker
>> > > > > > > > >>>>> > > > or
>> > > > > > > > >>>>> > > > >> > > limiting
>> > > > > > > > >>>>> > > > >> > > > > >> other
>> > > > > > > > >>>>> > > > >> > > > > >> >> resources). A different bunch
>> of
>> > > disks
>> > > > > can
>> > > > > > > > then
>> > > > > > > > >>>>> make
>> > > > > > > > >>>>> > up
>> > > > > > > > >>>>> > > a
>> > > > > > > > >>>>> > > > >> second
>> > > > > > > > >>>>> > > > >> > > > > >> volume,
>> > > > > > > > >>>>> > > > >> > > > > >> >> and be used by another Kafka
>> > broker.
>> > > > > This
>> > > > > > is
>> > > > > > > > >>>>> aligned
>> > > > > > > > >>>>> > > with
>> > > > > > > > >>>>> > > > >> what
>> > > > > > > > >>>>> > > > >> > > Colin is
>> > > > > > > > >>>>> > > > >> > > > > >> >> saying (as I understand it).
>> > > > > > > > >>>>> > > > >> > > > > >> >>
>> > > > > > > > >>>>> > > > >> > > > > >> >> Disks are not the only resource
>> > on a
>> > > > > > > machine,
>> > > > > > > > >>>>> there
>> > > > > > > > >>>>> > are
>> > > > > > > > >>>>> > > > >> several
>> > > > > > > > >>>>> > > > >> > > > > >> instances
>> > > > > > > > >>>>> > > > >> > > > > >> >> where multiple NICs are used
>> for
>> > > > > example.
>> > > > > > Do
>> > > > > > > > we
>> > > > > > > > >>>>> want
>> > > > > > > > >>>>> > > fine
>> > > > > > > > >>>>> > > > >> grained
>> > > > > > > > >>>>> > > > >> > > > > >> >> management of all these
>> resources?
>> > > I'd
>> > > > > > argue
>> > > > > > > > >>>>> that
>> > > > > > > > >>>>> > opens
>> > > > > > > > >>>>> > > us
>> > > > > > > > >>>>> > > > >> the
>> > > > > > > > >>>>> > > > >> > > system
>> > > > > > > > >>>>> > > > >> > > > > >> to a
>> > > > > > > > >>>>> > > > >> > > > > >> >> lot of complexity.
>> > > > > > > > >>>>> > > > >> > > > > >> >>
>> > > > > > > > >>>>> > > > >> > > > > >> >> Thanks
>> > > > > > > > >>>>> > > > >> > > > > >> >> Eno
>> > > > > > > > >>>>> > > > >> > > > > >> >>
>> > > > > > > > >>>>> > > > >> > > > > >> >>
>> > > > > > > > >>>>> > > > >> > > > > >> >>> On 1 Feb 2017, at 01:53, Dong
>> > Lin <
>> > > > > > > > >>>>> > lindon...@gmail.com
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > > > >> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>> Hi all,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>> I am going to initiate the
>> vote
>> > If
>> > > > > there
>> > > > > > is
>> > > > > > > > no
>> > > > > > > > >>>>> > further
>> > > > > > > > >>>>> > > > >> concern
>> > > > > > > > >>>>> > > > >> > > with
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >> KIP.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>> Thanks,
>> > > > > > > > >>>>> > > > >> > > > > >> >>> Dong
>> > > > > > > > >>>>> > > > >> > > > > >> >>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>> On Fri, Jan 27, 2017 at 8:08
>> PM,
>> > > > radai
>> > > > > <
>> > > > > > > > >>>>> > > > >> > > radai.rosenbl...@gmail.com>
>> > > > > > > > >>>>> > > > >> > > > > >> >> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> a few extra points:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> 1. broker per disk might also
>> > > incur
>> > > > > more
>> > > > > > > > >>>>> client <-->
>> > > > > > > > >>>>> > > > >> broker
>> > > > > > > > >>>>> > > > >> > > sockets:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> suppose every producer /
>> > consumer
>> > > > > > "talks"
>> > > > > > > to
>> > > > > > > > >>>>> >1
>> > > > > > > > >>>>> > > > partition,
>> > > > > > > > >>>>> > > > >> > > there's a
>> > > > > > > > >>>>> > > > >> > > > > >> >> very
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> good chance that partitions
>> that
>> > > > were
>> > > > > > > > >>>>> co-located on
>> > > > > > > > >>>>> > a
>> > > > > > > > >>>>> > > > >> single
>> > > > > > > > >>>>> > > > >> > > 10-disk
>> > > > > > > > >>>>> > > > >> > > > > >> >> broker
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> would now be split between
>> > several
>> > > > > > > > single-disk
>> > > > > > > > >>>>> > broker
>> > > > > > > > >>>>> > > > >> > > processes on
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >> same
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> machine. hard to put a
>> > multiplier
>> > > on
>> > > > > > this,
>> > > > > > > > but
>> > > > > > > > >>>>> > likely
>> > > > > > > > >>>>> > > > >x1.
>> > > > > > > > >>>>> > > > >> > > sockets
>> > > > > > > > >>>>> > > > >> > > > > >> are a
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> limited resource at the OS
>> level
>> > > and
>> > > > > > incur
>> > > > > > > > >>>>> some
>> > > > > > > > >>>>> > memory
>> > > > > > > > >>>>> > > > >> cost
>> > > > > > > > >>>>> > > > >> > > (kernel
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> buffers)
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2. there's a memory overhead
>> to
>> > > > > spinning
>> > > > > > > up
>> > > > > > > > a
>> > > > > > > > >>>>> JVM
>> > > > > > > > >>>>> > > > >> (compiled
>> > > > > > > > >>>>> > > > >> > > code and
>> > > > > > > > >>>>> > > > >> > > > > >> >> byte
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> code objects etc). if we
>> assume
>> > > this
>> > > > > > > > overhead
>> > > > > > > > >>>>> is
>> > > > > > > > >>>>> > ~300
>> > > > > > > > >>>>> > > MB
>> > > > > > > > >>>>> > > > >> > > (order of
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> magnitude, specifics vary)
>> than
>> > > > > spinning
>> > > > > > > up
>> > > > > > > > >>>>> 10 JVMs
>> > > > > > > > >>>>> > > > would
>> > > > > > > > >>>>> > > > >> lose
>> > > > > > > > >>>>> > > > >> > > you 3
>> > > > > > > > >>>>> > > > >> > > > > >> GB
>> > > > > > > > >>>>> > > > >> > > > > >> >> of
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> RAM. not a ton, but non
>> > > negligible.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> 3. there would also be some
>> > > overhead
>> > > > > > > > >>>>> downstream of
>> > > > > > > > >>>>> > > kafka
>> > > > > > > > >>>>> > > > >> in any
>> > > > > > > > >>>>> > > > >> > > > > >> >> management
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> / monitoring / log
>> aggregation
>> > > > system.
>> > > > > > > > likely
>> > > > > > > > >>>>> less
>> > > > > > > > >>>>> > > than
>> > > > > > > > >>>>> > > > >> x10
>> > > > > > > > >>>>> > > > >> > > though.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> 4. (related to above) - added
>> > > > > complexity
>> > > > > > > of
>> > > > > > > > >>>>> > > > administration
>> > > > > > > > >>>>> > > > >> > > with more
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> running instances.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> is anyone running kafka with
>> > > > anywhere
>> > > > > > near
>> > > > > > > > >>>>> 100GB
>> > > > > > > > >>>>> > > heaps?
>> > > > > > > > >>>>> > > > i
>> > > > > > > > >>>>> > > > >> > > thought the
>> > > > > > > > >>>>> > > > >> > > > > >> >> point
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> was to rely on kernel page
>> cache
>> > > to
>> > > > do
>> > > > > > the
>> > > > > > > > >>>>> disk
>> > > > > > > > >>>>> > > > buffering
>> > > > > > > > >>>>> > > > >> ....
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> On Thu, Jan 26, 2017 at 11:00
>> > AM,
>> > > > Dong
>> > > > > > > Lin <
>> > > > > > > > >>>>> > > > >> > > lindon...@gmail.com>
>> > > > > > > > >>>>> > > > >> > > > > >> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Hey Colin,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Thanks much for the comment.
>> > > Please
>> > > > > see
>> > > > > > > me
>> > > > > > > > >>>>> comment
>> > > > > > > > >>>>> > > > >> inline.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> On Thu, Jan 26, 2017 at
>> 10:15
>> > AM,
>> > > > > Colin
>> > > > > > > > >>>>> McCabe <
>> > > > > > > > >>>>> > > > >> > > cmcc...@apache.org>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> On Wed, Jan 25, 2017, at
>> > 13:50,
>> > > > Dong
>> > > > > > Lin
>> > > > > > > > >>>>> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Hey Colin,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Good point! Yeah we have
>> > > actually
>> > > > > > > > >>>>> considered and
>> > > > > > > > >>>>> > > > >> tested this
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> solution,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> which we call
>> > > > one-broker-per-disk.
>> > > > > It
>> > > > > > > > >>>>> would work
>> > > > > > > > >>>>> > > and
>> > > > > > > > >>>>> > > > >> should
>> > > > > > > > >>>>> > > > >> > > > > >> require
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> no
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> major change in Kafka as
>> > > compared
>> > > > > to
>> > > > > > > this
>> > > > > > > > >>>>> JBOD
>> > > > > > > > >>>>> > KIP.
>> > > > > > > > >>>>> > > > So
>> > > > > > > > >>>>> > > > >> it
>> > > > > > > > >>>>> > > > >> > > would
>> > > > > > > > >>>>> > > > >> > > > > >> be a
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> good
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> short term solution.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> But it has a few drawbacks
>> > > which
>> > > > > > makes
>> > > > > > > it
>> > > > > > > > >>>>> less
>> > > > > > > > >>>>> > > > >> desirable in
>> > > > > > > > >>>>> > > > >> > > the
>> > > > > > > > >>>>> > > > >> > > > > >> long
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> term.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Assume we have 10 disks
>> on a
>> > > > > machine.
>> > > > > > > > Here
>> > > > > > > > >>>>> are
>> > > > > > > > >>>>> > the
>> > > > > > > > >>>>> > > > >> problems:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Hi Dong,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Thanks for the thoughtful
>> > reply.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 1) Our stress test result
>> > shows
>> > > > > that
>> > > > > > > > >>>>> > > > >> one-broker-per-disk
>> > > > > > > > >>>>> > > > >> > > has 15%
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> lower
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> throughput
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 2) Controller would need
>> to
>> > > send
>> > > > > 10X
>> > > > > > as
>> > > > > > > > >>>>> many
>> > > > > > > > >>>>> > > > >> > > LeaderAndIsrRequest,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> MetadataUpdateRequest and
>> > > > > > > > >>>>> StopReplicaRequest.
>> > > > > > > > >>>>> > This
>> > > > > > > > >>>>> > > > >> > > increases the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> burden
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> on
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> controller which can be
>> the
>> > > > > > performance
>> > > > > > > > >>>>> > bottleneck.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Maybe I'm misunderstanding
>> > > > > something,
>> > > > > > > but
>> > > > > > > > >>>>> there
>> > > > > > > > >>>>> > > would
>> > > > > > > > >>>>> > > > >> not be
>> > > > > > > > >>>>> > > > >> > > 10x as
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> many
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> StopReplicaRequest RPCs,
>> would
>> > > > > there?
>> > > > > > > The
>> > > > > > > > >>>>> other
>> > > > > > > > >>>>> > > > >> requests
>> > > > > > > > >>>>> > > > >> > > would
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> increase
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> 10x, but from a pretty low
>> > base,
>> > > > > > right?
>> > > > > > > > We
>> > > > > > > > >>>>> are
>> > > > > > > > >>>>> > not
>> > > > > > > > >>>>> > > > >> > > reassigning
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> partitions all the time, I
>> > hope
>> > > > (or
>> > > > > > else
>> > > > > > > > we
>> > > > > > > > >>>>> have
>> > > > > > > > >>>>> > > > bigger
>> > > > > > > > >>>>> > > > >> > > > > >> problems...)
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I think the controller will
>> > group
>> > > > > > > > >>>>> > StopReplicaRequest
>> > > > > > > > >>>>> > > > per
>> > > > > > > > >>>>> > > > >> > > broker and
>> > > > > > > > >>>>> > > > >> > > > > >> >> send
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> only one StopReplicaRequest
>> to
>> > a
>> > > > > broker
>> > > > > > > > >>>>> during
>> > > > > > > > >>>>> > > > controlled
>> > > > > > > > >>>>> > > > >> > > shutdown.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> Anyway,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we don't have to worry about
>> > this
>> > > > if
>> > > > > we
>> > > > > > > > >>>>> agree that
>> > > > > > > > >>>>> > > > other
>> > > > > > > > >>>>> > > > >> > > requests
>> > > > > > > > >>>>> > > > >> > > > > >> will
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> increase by 10X. One
>> > > > MetadataRequest
>> > > > > to
>> > > > > > > > send
>> > > > > > > > >>>>> to
>> > > > > > > > >>>>> > each
>> > > > > > > > >>>>> > > > >> broker
>> > > > > > > > >>>>> > > > >> > > in the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> cluster
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> every time there is
>> leadership
>> > > > > change.
>> > > > > > I
>> > > > > > > am
>> > > > > > > > >>>>> not
>> > > > > > > > >>>>> > sure
>> > > > > > > > >>>>> > > > >> this is
>> > > > > > > > >>>>> > > > >> > > a real
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> problem. But in theory this
>> > makes
>> > > > the
>> > > > > > > > >>>>> overhead
>> > > > > > > > >>>>> > > > complexity
>> > > > > > > > >>>>> > > > >> > > O(number
>> > > > > > > > >>>>> > > > >> > > > > >> of
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> broker) and may be a
>> concern in
>> > > the
>> > > > > > > future.
>> > > > > > > > >>>>> Ideally
>> > > > > > > > >>>>> > > we
>> > > > > > > > >>>>> > > > >> should
>> > > > > > > > >>>>> > > > >> > > avoid
>> > > > > > > > >>>>> > > > >> > > > > >> it.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 3) Less efficient use of
>> > > physical
>> > > > > > > > resource
>> > > > > > > > >>>>> on the
>> > > > > > > > >>>>> > > > >> machine.
>> > > > > > > > >>>>> > > > >> > > The
>> > > > > > > > >>>>> > > > >> > > > > >> number
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> of
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> socket on each machine
>> will
>> > > > > increase
>> > > > > > by
>> > > > > > > > >>>>> 10X. The
>> > > > > > > > >>>>> > > > >> number of
>> > > > > > > > >>>>> > > > >> > > > > >> connection
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between any two machine
>> will
>> > > > > increase
>> > > > > > > by
>> > > > > > > > >>>>> 100X.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 4) Less efficient way to
>> > > > management
>> > > > > > > > memory
>> > > > > > > > >>>>> and
>> > > > > > > > >>>>> > > quota.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 5) Rebalance between
>> > > > disks/brokers
>> > > > > on
>> > > > > > > the
>> > > > > > > > >>>>> same
>> > > > > > > > >>>>> > > > machine
>> > > > > > > > >>>>> > > > >> will
>> > > > > > > > >>>>> > > > >> > > less
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> efficient
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> and less flexible. Broker
>> has
>> > > to
>> > > > > read
>> > > > > > > > data
>> > > > > > > > >>>>> from
>> > > > > > > > >>>>> > > > another
>> > > > > > > > >>>>> > > > >> > > broker on
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> same
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> machine via socket. It is
>> > also
>> > > > > harder
>> > > > > > > to
>> > > > > > > > do
>> > > > > > > > >>>>> > > automatic
>> > > > > > > > >>>>> > > > >> load
>> > > > > > > > >>>>> > > > >> > > balance
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> disks on the same machine
>> in
>> > > the
>> > > > > > > future.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I will put this and the
>> > > > explanation
>> > > > > > in
>> > > > > > > > the
>> > > > > > > > >>>>> > rejected
>> > > > > > > > >>>>> > > > >> > > alternative
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> section.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> have a few questions:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Can you explain why this
>> > > > solution
>> > > > > > can
>> > > > > > > > >>>>> help
>> > > > > > > > >>>>> > avoid
>> > > > > > > > >>>>> > > > >> > > scalability
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> bottleneck?
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I actually think it will
>> > > > exacerbate
>> > > > > > the
>> > > > > > > > >>>>> > scalability
>> > > > > > > > >>>>> > > > >> problem
>> > > > > > > > >>>>> > > > >> > > due
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2)
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> above.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Why can we push more RPC
>> > with
>> > > > > this
>> > > > > > > > >>>>> solution?
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> To really answer this
>> question
>> > > > we'd
>> > > > > > have
>> > > > > > > > to
>> > > > > > > > >>>>> take a
>> > > > > > > > >>>>> > > > deep
>> > > > > > > > >>>>> > > > >> dive
>> > > > > > > > >>>>> > > > >> > > into
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> locking of the broker and
>> > figure
>> > > > out
>> > > > > > how
>> > > > > > > > >>>>> > effectively
>> > > > > > > > >>>>> > > > it
>> > > > > > > > >>>>> > > > >> can
>> > > > > > > > >>>>> > > > >> > > > > >> >> parallelize
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> truly independent requests.
>> > > > Almost
>> > > > > > > every
>> > > > > > > > >>>>> > > > multithreaded
>> > > > > > > > >>>>> > > > >> > > process is
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> going
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> to have shared state, like
>> > > shared
>> > > > > > queues
>> > > > > > > > or
>> > > > > > > > >>>>> shared
>> > > > > > > > >>>>> > > > >> sockets,
>> > > > > > > > >>>>> > > > >> > > that is
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> going to make scaling less
>> > than
>> > > > > linear
>> > > > > > > > when
>> > > > > > > > >>>>> you
>> > > > > > > > >>>>> > add
>> > > > > > > > >>>>> > > > >> disks or
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> processors.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> (And clearly, another
>> option
>> > is
>> > > to
>> > > > > > > improve
>> > > > > > > > >>>>> that
>> > > > > > > > >>>>> > > > >> scalability,
>> > > > > > > > >>>>> > > > >> > > rather
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> than going multi-process!)
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Yeah I also think it is
>> better
>> > to
>> > > > > > improve
>> > > > > > > > >>>>> > scalability
>> > > > > > > > >>>>> > > > >> inside
>> > > > > > > > >>>>> > > > >> > > kafka
>> > > > > > > > >>>>> > > > >> > > > > >> code
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> if
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> possible. I am not sure we
>> > > > currently
>> > > > > > have
>> > > > > > > > any
>> > > > > > > > >>>>> > > > scalability
>> > > > > > > > >>>>> > > > >> > > issue
>> > > > > > > > >>>>> > > > >> > > > > >> inside
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Kafka that can not be
>> removed
>> > > > without
>> > > > > > > using
>> > > > > > > > >>>>> > > > >> multi-process.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - It is true that a
>> garbage
>> > > > > > collection
>> > > > > > > in
>> > > > > > > > >>>>> one
>> > > > > > > > >>>>> > > broker
>> > > > > > > > >>>>> > > > >> would
>> > > > > > > > >>>>> > > > >> > > not
>> > > > > > > > >>>>> > > > >> > > > > >> affect
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> others. But that is after
>> > every
>> > > > > > broker
>> > > > > > > > >>>>> only uses
>> > > > > > > > >>>>> > > 1/10
>> > > > > > > > >>>>> > > > >> of the
>> > > > > > > > >>>>> > > > >> > > > > >> memory.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Can
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> we be sure that this will
>> > > > actually
>> > > > > > help
>> > > > > > > > >>>>> > > performance?
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> The big question is, how
>> much
>> > > > memory
>> > > > > > do
>> > > > > > > > >>>>> Kafka
>> > > > > > > > >>>>> > > brokers
>> > > > > > > > >>>>> > > > >> use
>> > > > > > > > >>>>> > > > >> > > now, and
>> > > > > > > > >>>>> > > > >> > > > > >> how
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> much will they use in the
>> > > future?
>> > > > > Our
>> > > > > > > > >>>>> experience
>> > > > > > > > >>>>> > in
>> > > > > > > > >>>>> > > > >> HDFS
>> > > > > > > > >>>>> > > > >> > > was that
>> > > > > > > > >>>>> > > > >> > > > > >> >> once
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> you start getting more than
>> > > > > 100-200GB
>> > > > > > > Java
>> > > > > > > > >>>>> heap
>> > > > > > > > >>>>> > > sizes,
>> > > > > > > > >>>>> > > > >> full
>> > > > > > > > >>>>> > > > >> > > GCs
>> > > > > > > > >>>>> > > > >> > > > > >> start
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> taking minutes to finish
>> when
>> > > > using
>> > > > > > the
>> > > > > > > > >>>>> standard
>> > > > > > > > >>>>> > > JVMs.
>> > > > > > > > >>>>> > > > >> That
>> > > > > > > > >>>>> > > > >> > > alone
>> > > > > > > > >>>>> > > > >> > > > > >> is
>> > > > > > > > >>>>> > > > >> > > > > >> >> a
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> good reason to go
>> > multi-process
>> > > or
>> > > > > > > > consider
>> > > > > > > > >>>>> > storing
>> > > > > > > > >>>>> > > > more
>> > > > > > > > >>>>> > > > >> > > things off
>> > > > > > > > >>>>> > > > >> > > > > >> >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java heap.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I see. Now I agree
>> > > > > one-broker-per-disk
>> > > > > > > > >>>>> should be
>> > > > > > > > >>>>> > more
>> > > > > > > > >>>>> > > > >> > > efficient in
>> > > > > > > > >>>>> > > > >> > > > > >> >> terms
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> of
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> GC since each broker
>> probably
>> > > needs
>> > > > > > less
>> > > > > > > > >>>>> than 1/10
>> > > > > > > > >>>>> > of
>> > > > > > > > >>>>> > > > the
>> > > > > > > > >>>>> > > > >> > > memory
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> available
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> on a typical machine
>> nowadays.
>> > I
>> > > > will
>> > > > > > > > remove
>> > > > > > > > >>>>> this
>> > > > > > > > >>>>> > > from
>> > > > > > > > >>>>> > > > >> the
>> > > > > > > > >>>>> > > > >> > > reason of
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> rejection.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Disk failure is the "easy"
>> > case.
>> > > > > The
>> > > > > > > > >>>>> "hard" case,
>> > > > > > > > >>>>> > > > >> which is
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unfortunately also the much
>> > more
>> > > > > > common
>> > > > > > > > >>>>> case, is
>> > > > > > > > >>>>> > > disk
>> > > > > > > > >>>>> > > > >> > > misbehavior.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Towards the end of their
>> > lives,
>> > > > > disks
>> > > > > > > tend
>> > > > > > > > >>>>> to
>> > > > > > > > >>>>> > start
>> > > > > > > > >>>>> > > > >> slowing
>> > > > > > > > >>>>> > > > >> > > down
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unpredictably.  Requests
>> that
>> > > > would
>> > > > > > have
>> > > > > > > > >>>>> completed
>> > > > > > > > >>>>> > > > >> > > immediately
>> > > > > > > > >>>>> > > > >> > > > > >> before
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> start taking 20, 100 500
>> > > > > milliseconds.
>> > > > > > > > >>>>> Some files
>> > > > > > > > >>>>> > > may
>> > > > > > > > >>>>> > > > >> be
>> > > > > > > > >>>>> > > > >> > > readable
>> > > > > > > > >>>>> > > > >> > > > > >> and
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> other files may not be.
>> > System
>> > > > > calls
>> > > > > > > > hang,
>> > > > > > > > >>>>> > > sometimes
>> > > > > > > > >>>>> > > > >> > > forever, and
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java process can't abort
>> them,
>> > > > > because
>> > > > > > > the
>> > > > > > > > >>>>> hang is
>> > > > > > > > >>>>> > > in
>> > > > > > > > >>>>> > > > >> the
>> > > > > > > > >>>>> > > > >> > > kernel.
>> > > > > > > > >>>>> > > > >> > > > > >> It
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> is
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> not fun when threads are
>> stuck
>> > > in
>> > > > "D
>> > > > > > > > state"
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > http://stackoverflow.com/quest
>> > > > > > > > >>>>> > > > >> ions/20423521/process-perminan
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> tly-stuck-on-d-state
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> .  Even kill -9 cannot
>> abort
>> > the
>> > > > > > thread
>> > > > > > > > >>>>> then.
>> > > > > > > > >>>>> > > > >> Fortunately,
>> > > > > > > > >>>>> > > > >> > > this is
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> rare.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I agree it is a harder
>> problem
>> > > and
>> > > > it
>> > > > > > is
>> > > > > > > > >>>>> rare. We
>> > > > > > > > >>>>> > > > >> probably
>> > > > > > > > >>>>> > > > >> > > don't
>> > > > > > > > >>>>> > > > >> > > > > >> have
>> > > > > > > > >>>>> > > > >> > > > > >> >> to
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> worry about it in this KIP
>> > since
>> > > > this
>> > > > > > > issue
>> > > > > > > > >>>>> is
>> > > > > > > > >>>>> > > > >> orthogonal to
>> > > > > > > > >>>>> > > > >> > > > > >> whether or
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> not
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we use JBOD.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Another approach we should
>> > > > consider
>> > > > > is
>> > > > > > > for
>> > > > > > > > >>>>> Kafka
>> > > > > > > > >>>>> > to
>> > > > > > > > >>>>> > > > >> > > implement its
>> > > > > > > > >>>>> > > > >> > > > > >> own
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> storage layer that would
>> > stripe
>> > > > > across
>> > > > > > > > >>>>> multiple
>> > > > > > > > >>>>> > > disks.
>> > > > > > > > >>>>> > > > >> This
>> > > > > > > > >>>>> > > > >> > > > > >> wouldn't
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> have to be done at the
>> block
>> > > > level,
>> > > > > > but
>> > > > > > > > >>>>> could be
>> > > > > > > > >>>>> > > done
>> > > > > > > > >>>>> > > > >> at the
>> > > > > > > > >>>>> > > > >> > > file
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> level.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> We could use consistent
>> > hashing
>> > > to
>> > > > > > > > >>>>> determine which
>> > > > > > > > >>>>> > > > >> disks a
>> > > > > > > > >>>>> > > > >> > > file
>> > > > > > > > >>>>> > > > >> > > > > >> should
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> end up on, for example.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Are you suggesting that we
>> > should
>> > > > > > > > distribute
>> > > > > > > > >>>>> log,
>> > > > > > > > >>>>> > or
>> > > > > > > > >>>>> > > > log
>> > > > > > > > >>>>> > > > >> > > segment,
>> > > > > > > > >>>>> > > > >> > > > > >> >> across
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> disks of brokers? I am not
>> sure
>> > > if
>> > > > I
>> > > > > > > fully
>> > > > > > > > >>>>> > understand
>> > > > > > > > >>>>> > > > >> this
>> > > > > > > > >>>>> > > > >> > > > > >> approach. My
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> gut
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> feel is that this would be a
>> > > > drastic
>> > > > > > > > >>>>> solution that
>> > > > > > > > >>>>> > > > would
>> > > > > > > > >>>>> > > > >> > > require
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> non-trivial design. While
>> this
>> > > may
>> > > > be
>> > > > > > > > useful
>> > > > > > > > >>>>> to
>> > > > > > > > >>>>> > > Kafka,
>> > > > > > > > >>>>> > > > I
>> > > > > > > > >>>>> > > > >> would
>> > > > > > > > >>>>> > > > >> > > > > >> prefer
>> > > > > > > > >>>>> > > > >> > > > > >> >> not
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to discuss this in detail in
>> > this
>> > > > > > thread
>> > > > > > > > >>>>> unless you
>> > > > > > > > >>>>> > > > >> believe
>> > > > > > > > >>>>> > > > >> > > it is
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> strictly
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> superior to the design in
>> this
>> > > KIP
>> > > > in
>> > > > > > > terms
>> > > > > > > > >>>>> of
>> > > > > > > > >>>>> > > solving
>> > > > > > > > >>>>> > > > >> our
>> > > > > > > > >>>>> > > > >> > > use-case.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> best,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Colin
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Thanks,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Dong
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> On Wed, Jan 25, 2017 at
>> 11:34
>> > > AM,
>> > > > > > Colin
>> > > > > > > > >>>>> McCabe <
>> > > > > > > > >>>>> > > > >> > > > > >> cmcc...@apache.org>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Hi Dong,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Thanks for the writeup!
>> > It's
>> > > > very
>> > > > > > > > >>>>> interesting.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> I apologize in advance if
>> > this
>> > > > has
>> > > > > > > been
>> > > > > > > > >>>>> > discussed
>> > > > > > > > >>>>> > > > >> > > somewhere else.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> But
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> I
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> am curious if you have
>> > > > considered
>> > > > > > the
>> > > > > > > > >>>>> solution
>> > > > > > > > >>>>> > of
>> > > > > > > > >>>>> > > > >> running
>> > > > > > > > >>>>> > > > >> > > > > >> multiple
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> brokers per node.
>> Clearly
>> > > there
>> > > > > is
>> > > > > > a
>> > > > > > > > >>>>> memory
>> > > > > > > > >>>>> > > > overhead
>> > > > > > > > >>>>> > > > >> with
>> > > > > > > > >>>>> > > > >> > > this
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> solution
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> because of the fixed
>> cost of
>> > > > > > starting
>> > > > > > > > >>>>> multiple
>> > > > > > > > >>>>> > > JVMs.
>> > > > > > > > >>>>> > > > >> > > However,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> running
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> multiple JVMs would help
>> > avoid
>> > > > > > > > scalability
>> > > > > > > > >>>>> > > > >> bottlenecks.
>> > > > > > > > >>>>> > > > >> > > You
>> > > > > > > > >>>>> > > > >> > > > > >> could
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> probably push more RPCs
>> per
>> > > > > second,
>> > > > > > > for
>> > > > > > > > >>>>> example.
>> > > > > > > > >>>>> > > A
>> > > > > > > > >>>>> > > > >> garbage
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> collection
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> in one broker would not
>> > affect
>> > > > the
>> > > > > > > > >>>>> others.  It
>> > > > > > > > >>>>> > > would
>> > > > > > > > >>>>> > > > >> be
>> > > > > > > > >>>>> > > > >> > > > > >> interesting
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> see this considered in
>> the
>> > > > > > "alternate
>> > > > > > > > >>>>> designs"
>> > > > > > > > >>>>> > > > design,
>> > > > > > > > >>>>> > > > >> > > even if
>> > > > > > > > >>>>> > > > >> > > > > >> you
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> end
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> up deciding it's not the
>> way
>> > > to
>> > > > > go.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> best,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Colin
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> On Thu, Jan 12, 2017, at
>> > > 10:46,
>> > > > > Dong
>> > > > > > > Lin
>> > > > > > > > >>>>> wrote:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Hi all,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> We created KIP-112:
>> Handle
>> > > disk
>> > > > > > > failure
>> > > > > > > > >>>>> for
>> > > > > > > > >>>>> > JBOD.
>> > > > > > > > >>>>> > > > >> Please
>> > > > > > > > >>>>> > > > >> > > find
>> > > > > > > > >>>>> > > > >> > > > > >> the
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> KIP
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> wiki
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> in the link
>> > > > > > > > >>>>> https://cwiki.apache.org/confl
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> uence/display/KAFKA/KIP-
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
>> 112%3A+Handle+disk+failure+
>> > > > > > for+JBOD.
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> This KIP is related to
>> > > KIP-113
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> <
>> > > https://cwiki.apache.org/conf
>> > > > > > > > >>>>> > > > >> luence/display/KAFKA/KIP-
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
>> > 113%3A+Support+replicas+moveme
>> > > > > > > > >>>>> > > > >> nt+between+log+directories>:
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Support replicas
>> movement
>> > > > between
>> > > > > > log
>> > > > > > > > >>>>> > > directories.
>> > > > > > > > >>>>> > > > >> They
>> > > > > > > > >>>>> > > > >> > > are
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> needed
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> in
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> order
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> to support JBOD in
>> Kafka.
>> > > > Please
>> > > > > > help
>> > > > > > > > >>>>> review
>> > > > > > > > >>>>> > the
>> > > > > > > > >>>>> > > > >> KIP. You
>> > > > > > > > >>>>> > > > >> > > > > >> >>>> feedback
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>> is
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> appreciated!
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Thanks,
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Dong
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>>>
>> > > > > > > > >>>>> > > > >> > > > > >> >>
>> > > > > > > > >>>>> > > > >> > > > > >> >>
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >>
>> > > > > > > > >>>>> > > > >> > > > > >
>> > > > > > > > >>>>> > > > >> > >
>> > > > > > > > >>>>> > > > >>
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > > >
>> > > > > > > > >>>>> > > >
>> > > > > > > > >>>>> > >
>> > > > > > > > >>>>> >
>> > > > > > > > >>>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to