Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Dong Lin Thu, 23 Feb 2017 16:49:09 -0800

Hey Jun,

Thanks for you reply! Let me first comment on the things that you listed as
advantage of B over A.


1) No change in LeaderAndIsrRequest protocol.

I agree with this.

2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK
writes to record the created flag.

I don't think this is true. There will be one round of LeaderAndIsrRequest
in both A and B. In the design A controller needs to write to ZK once to
record this replica as created. The design B the broker needs to write
zookeeper once to record this replica as created. So there is same number
of LeaderAndIsrRequest and ZK writes.

Broker needs to record created replica in design B so that when it
bootstraps with failed log directory, the broker can derive the offline
replicas as the difference between created replicas and replicas found on
good log directories.

3) Step 2. One less round of LeaderAndIsrRequest and no additional logic to
handle LeaderAndIsrResponse.

While I agree there is one less round of LeaderAndIsrRequest in design B, I
don't think one additional LeaderAndIsrRequest to handle log directory
failure is a big deal given that it doesn't happen frequently.

Also, while there is no additional logic to handle LeaderAndIsrResponse in
design B, I actually think this is something that controller should do
anyway. Say the broker stops responding to any requests without removing
itself from zookeeper, the only way for controller to realize this and
re-elect leader is to send request to this broker and handle response. The
is a problem that we don't do it as of now.

4) Step 6. Additional ZK reads proportional to # of failed log directories,
instead of # of partitions.

If one znode is able to describe all topic partitions in a log directory,
then the existing znode /brokers/topics/[topic] should be able to describe
created replicas in addition to the assigned replicas for every partition
of the topic. In this case, design A requires no additional ZK reads
whereas design B ZK reads proportional to # of failed log directories.

5) Step 3. In design A, if a broker is restarted and the failed log
directory is unreadable, the broker doesn't know which replicas are on the
failed log directory. So, when the broker receives the LeadAndIsrRequest
with created = false, it's bit hard for the broker to decide whether it
should create the missing replica on other log directories. This is easier
in design B since the list of failed replicas are persisted in ZK.

I don't understand why it is hard for broker to make decision in design A.
With design A, if a broker is started with a failed log directory and it
receives LeaderAndIsrRequest with created=false for a replica that can not
be found on any good log directory, broker will not create this replica. Is
there any drawback with this approach?


Here is my summary of pros and cons of design B as compared to design A.

pros:

1) No change to LeaderAndIsrRequest.
2) One less round of LeaderAndIsrRequest in case of log directory failure.

cons:

1) This is impossible for broker to figure out the log directory of offline
replicas for failed-log-directory/[directory] if multiple log directories
are unreadable when broker starts.

2) The znode size limit of failed-log-directory/[directory] essentially
limits the number of topic partitions that can exist on a log directory. It
becomes more of a problem when a broker is configured to use multiple log
directories each of which is a RAID-10 of large capacity. While this may
not be a problem in practice with additional requirement (e.g. don't use
more than one log directory if using RAID-10), ideally we want to avoid
such limit.

3) Extra ZK read of failed-log-directory/[directory] when broker starts


My main concern with the design B is the use of znode
/brokers/ids/[brokerId]/failed-log-directory/[directory]. I don't really
think other pros/cons of design B matter to us. Does my summary make sense?

Thanks,
Dong


On Thu, Feb 23, 2017 at 2:20 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Dong,
>
> Just so that we are on the same page. Let me spec out the alternative
> design a bit more and then compare. Let's call the current design A and the
> alternative design B.
>
> Design B:
>
> New ZK path
> failed log directory path (persistent): This is created by a broker when a
> log directory fails and is potentially removed when the broker is
> restarted.
> /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of the
> replicas in the log directory }.
>
> *1. Topic gets created*
> - Works the same as before.
>
> *2. A log directory stops working on a broker during runtime*
>
> - The controller watches the path /failed-log-directory for the new znode.
>
> - The broker detects an offline log directory during runtime and marks
> affected replicas as offline in memory.
>
> - The broker writes the failed directory and all replicas in the failed
> directory under /failed-log-directory/directory1.
>
> - The controller reads /failed-log-directory/directory1 and stores in
> memory a list of failed replicas due to disk failures.
>
> - The controller moves those replicas due to disk failure to offline state
> and triggers the state change in replica state machine.
>
>
> *3. Broker is restarted*
>
> - The broker reads /brokers/ids/[brokerId]/failed-log-directory, if any.
>
> - For each failed log directory it reads from ZK, if the log directory
> exists in log.dirs and is accessible now, or if the log directory no longer
> exists in log.dirs, remove that log directory from failed-log-directory.
> Otherwise, the broker loads replicas in the failed log directory in memory
> as offline.
>
> - The controller handles the failed log directory change event, if needed
> (same as #2).
>
> - The controller handles the broker registration event.
>
>
> *6. Controller failover*
> - Controller reads all child paths under /failed-log-directory to rebuild
> the list of failed replicas due to disk failures. Those replicas will be
> transitioned to the offline state during controller initialization.
>
> Comparing this with design A, I think the following are the things that
> design B simplifies.
> * No change in LeaderAndIsrRequest protocol.
> * Step 1. One less round of LeaderAndIsrRequest and no additional ZK writes
> to record the created flag.
> * Step 2. One less round of LeaderAndIsrRequest and no additional logic to
> handle LeaderAndIsrResponse.
> * Step 6. Additional ZK reads proportional to # of failed log directories,
> instead of # of partitions.
> * Step 3. In design A, if a broker is restarted and the failed log
> directory is unreadable, the broker doesn't know which replicas are on the
> failed log directory. So, when the broker receives the LeadAndIsrRequest
> with created = false, it's bit hard for the broker to decide whether it
> should create the missing replica on other log directories. This is easier
> in design B since the list of failed replicas are persisted in ZK.
>
> Now, for some of the other things that you mentioned.
>
> * What happens if a log directory is renamed?
> I think this can be handled in the same way as non-existing log directory
> during broker restart.
>
> * What happens if replicas are moved manually across disks?
> Good point. Well, if all log directories are available, the failed log
> directory path will be cleared. In the rarer case that a log directory is
> still offline and one of the replicas registered in the failed log
> directory shows up in another available log directory, I am not quite sure.
> Perhaps the simplest approach is to just error out and let the admin fix
> things manually?
>
> Thanks,
>
> Jun
>
>
>
> On Wed, Feb 22, 2017 at 3:39 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > Thanks much for the explanation. I have some questions about 21 but that
> is
> > less important than 20. 20 would require considerable change to the KIP
> and
> > probably requires weeks to discuss again. Thus I would like to be very
> sure
> > that we agree on the problems with the current design as you mentioned
> and
> > there is no foreseeable problem with the alternate design.
> >
> > Please see below I detail response. To summarize my points, I couldn't
> > figure out any non-trival drawback of the current design as compared to
> the
> > alternative design; and I couldn't figure out a good way to store offline
> > replicas in the alternative design. Can you see if these points make
> sense?
> > Thanks in advance for your time!!
> >
> >
> > 1) The alternative design requires slightly more dependency on ZK. While
> > both solutions store created replicas in the ZK, the alternative design
> > would also store offline replicas in ZK but the current design doesn't.
> > Thus
> >
> > 2) I am not sure that we should store offline replicas in znode
> > /brokers/ids/[brokerId]/failed-log-directory/[directory]. We probably
> > don't
> > want to expose log directory path in zookeeper based on the concept that
> we
> > should only store logical information (e.g. topic, brokerId) in
> zookeeper's
> > path name. More specifically, we probably don't want to rename path in
> > zookeeper simply because user renamed a log director. And we probably
> don't
> > want to read/write these znode just because user manually moved replicas
> > between log directories.
> >
> > 3) I couldn't find a good way to store offline replicas in ZK in the
> > alternative design. We can store this information one znode per-topic,
> > per-brokerId, or per-brokerId-topic. All these choices have their own
> > problems. If we store it in per-topic znode then multiple brokers may
> need
> > to read/write offline replicas in the same znode which is generally bad.
> If
> > we store it per-brokerId then we effectively limit the maximum number of
> > topic-partition that can be stored on a broker by the znode size limit.
> > This contradicts the idea to expand the single broker capacity by
> throwing
> > in more disks. If we store it per-brokerId-topic, then when controller
> > starts, it needs to read number of brokerId*topic znodes which may double
> > the overall znode reads during controller startup.
> >
> > 4) The alternative design is less efficient than the current design in
> case
> > of log directory failure. The alternative design requires extra znode
> reads
> > in order to read offline replicas from zk while the current design
> requires
> > only one pair of LeaderAndIsrRequest and LeaderAndIsrResponse. The extra
> > znode reads will be proportional to the number of topics on the broker if
> > we store offline replicas per-brokerId-topic.
> >
> > 5) While I agree that the failure reporting should be done where the
> > failure is originated, I think the current design is consistent with what
> > we are already doing. With the current design, broker will send
> > notification via zookeeper and controller will send LeaderAndIsrRequest
> to
> > broker. This is similar to how broker sends isr change notification and
> > controller read latest isr from broker. If we do want broker to report
> > failure directly to controller, we should probably have broker send RPC
> > directly to controller as it sends ControllerShutdownRequest. I can do
> this
> > as well.
> >
> > 6) I don't think the current design requires additional state management
> in
> > each of the existing state handling such as topic creation or controller
> > failover. All these existing logic should stay exactly the same except
> that
> > the controller should recognize offline replicas on the live broker
> instead
> > of assuming all replicas on live brokers are live. But this additional
> > change is required in both the current design and the alternate design.
> > Thus there should be no difference between current design and the
> alternate
> > design with respect to these existing state handling logic in controller.
> >
> > 7) While I agree that the current design requires additional complexity
> in
> > the controller in order to handle LeaderAndIsrResponse and potentially
> > change partition and replica state to offline in the sate machines, I
> think
> > such logic is necessary in a well-designed controller either with the
> > alternate design or even without JBOD. Controller should be able to
> handle
> > error (e.g. ClusterAuthorizationException) in LeaderAndIsrResponse and
> > every responses in general. For example, if the controller hasn't
> received
> > LeaderAndIsrResponse after a given period if time, it probably means the
> > broker has hang and the controller should consider this broker as offline
> > and re-elect leader from other brokers. This would actually fix some
> > problem we have seen before at LinkedIn, where broker hangs due to
> > RAID-controller failure. In other words, I think it is a good idea for
> > controller to handle response.
> >
> > 8) I am not sure that the additional state management to handle
> > LeaderAndIsrResponse causes new types of synchronization. It is true that
> > the logic is not handled ZK event handling
> > thread. But the existing ControllerShutdownRequest is also not handled by
> > ZK event handling thread. The LeaderAndIsrReponse can be handled by the
> > same thread is that currently handling ControllerShutdownRequest so that
> we
> > don't require new type of synchronization. Further, it should be rare to
> > require additional synchronization to handle LeaderAndIsrReponse because
> we
> > only need synchronization when there are offline replicas.
> >
> >
> > Thanks,
> > Dong
> >
> > On Wed, Feb 22, 2017 at 10:36 AM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Dong, Jiangjie,
> > >
> > > 20. (1) I agree that ideally we'd like to use direct RPC for
> > > broker-to-broker communication instead of ZK. However, in the
> alternative
> > > design, the failed log directory path also serves as the persistent
> state
> > > for remembering the offline partitions. This is similar to the
> > > controller_managed_state path in your design. The difference is that
> the
> > > alternative design stores the state in fewer ZK paths, which helps
> reduce
> > > the controller failover time. (2) I agree that we want the controller
> to
> > be
> > > the single place to make decisions. However, intuitively, the failure
> > > reporting should be done where the failure is originated. For example,
> > if a
> > > broker fails, the broker reports failure by de-registering from ZK. The
> > > failed log directory path is similar in that regard. (3) I am not
> worried
> > > about the additional load from extra LeaderAndIsrRequest. What I worry
> > > about is any unnecessary additional complexity in the controller. To
> me,
> > > the additional complexity in the current design is the additional state
> > > management in each of the existing state handling (e.g., topic
> creation,
> > > controller failover, etc), and the additional synchronization since the
> > > additional state management is not initiated from the ZK event handling
> > > thread.
> > >
> > > 21. One of the reasons that we need to send a StopReplicaRequest to
> > offline
> > > replica is to handle controlled shutdown. In that case, a broker is
> still
> > > alive, but indicates to the controller that it plans to shut down.
> Being
> > > able to stop the replica in the shutting down broker reduces churns in
> > ISR.
> > > So, for simplicity, it's probably easier to always send a
> > > StopReplicaRequest
> > > to any offline replica.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Tue, Feb 21, 2017 at 2:37 PM, Dong Lin <lindon...@gmail.com> wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > Thanks much for your comments.
> > > >
> > > > I actually proposed the design to store both offline replicas and
> > created
> > > > replicas in per-broker znode before switching to the design in the
> > > current
> > > > KIP. The current design stores created replicas in per-partition
> znode
> > > and
> > > > transmits offline replicas via LeaderAndIsrResponse. The original
> > > solution
> > > > is roughly the same as what you suggested. The advantage of the
> current
> > > > solution is kind of philosophical: 1) we want to transmit data (e.g.
> > > > offline replicas) using RPC and reduce dependency on zookeeper; 2) we
> > > want
> > > > controller to be the only one that determines any state (e.g. offline
> > > > replicas) that will be exposed to user. The advantage of the solution
> > to
> > > > store offline replica in zookeeper is that we can save one roundtrip
> > time
> > > > for controller to handle log directory failure. However, this extra
> > > > roundtrip time should not be a big deal since the log directory
> failure
> > > is
> > > > rare and inefficiency of extra latency is less of a problem when
> there
> > is
> > > > log directory failure.
> > > >
> > > > Do you think the two philosophical advantages of the current KIP make
> > > > sense? If not, then I can switch to the original design that stores
> > > offline
> > > > replicas in zookeeper. It is actually written already. One
> disadvantage
> > > is
> > > > that we have to make non-trivial change the KIP (e.g. no create flag
> in
> > > > LeaderAndIsrRequest and no created flag zookeeper) and restart this
> KIP
> > > > discussion.
> > > >
> > > > Regarding 21, it seems to me that LeaderAndIsrRequest/
> > StopReplicaRequest
> > > > only makes sense when broker can make the choice (e.g. fetch data for
> > > this
> > > > replica or not). In the case that the log directory of the replica is
> > > > already offline, broker have to stop fetching data for this replica
> > > > regardless of what controller tells it to do. Thus it seems cleaner
> for
> > > > broker to stop fetch data for this replica immediately. The advantage
> > of
> > > > this solution is that the controller logic is simpler since it
> doesn't
> > > need
> > > > to send StopReplicaRequest in case of log directory failure, and the
> > > log4j
> > > > log is also cleaner. Is there specific advantage of having controller
> > > send
> > > > tells broker to stop fetching data for offline replicas?
> > > >
> > > > Regarding 22, I agree with your observation that it will happen. I
> will
> > > > update the KIP and specify that broker will exist with proper error
> > > message
> > > > in the log and user needs to manually remove partitions and restart
> the
> > > > broker.
> > > >
> > > > Thanks!
> > > > Dong
> > > >
> > > >
> > > >
> > > > On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Hi, Dong,
> > > > >
> > > > > Sorry for the delay. A few more comments.
> > > > >
> > > > > 20. One complexity that I found in the current KIP is that the way
> > the
> > > > > broker communicates failed replicas to the controller is
> inefficient.
> > > > When
> > > > > a log directory fails, the broker only sends an indication through
> ZK
> > > to
> > > > > the controller and the controller has to issue a
> LeaderAndIsrRequest
> > to
> > > > > discover which replicas are offline due to log directory failure.
> An
> > > > > alternative approach is that when a log directory fails, the broker
> > > just
> > > > > writes the failed the directory and the corresponding topic
> > partitions
> > > > in a
> > > > > new failed log directory ZK path like the following.
> > > > >
> > > > > Failed log directory path:
> > > > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json
> of
> > > the
> > > > > topic partitions in the log directory }.
> > > > >
> > > > > The controller just watches for child changes in
> > > > > /brokers/ids/[brokerId]/failed-log-directory.
> > > > > After reading this path, the broker knows the exact set of replicas
> > > that
> > > > > are offline and can trigger that replica state change accordingly.
> > This
> > > > > saves an extra round of LeaderAndIsrRequest handling.
> > > > >
> > > > > With this new ZK path, we get probably get rid
> > > of/broker/topics/[topic]/
> > > > > partitions/[partitionId]/controller_managed_state. The creation
> of a
> > > new
> > > > > replica is expected to always succeed unless all log directories
> > fail,
> > > in
> > > > > which case, the broker goes down anyway. Then, during controller
> > > > failover,
> > > > > the controller just needs to additionally read from ZK the extra
> > failed
> > > > log
> > > > > directory paths, which is many fewer than topics or partitions.
> > > > >
> > > > > On broker startup, if a log directory becomes available, the
> > > > corresponding
> > > > > log directory path in ZK will be removed.
> > > > >
> > > > > The downside of this approach is that the value of this new ZK path
> > can
> > > > be
> > > > > large. However, even with 5K partition per log directory and 100
> > bytes
> > > > per
> > > > > partition, the size of the value is 500KB, still less than the
> > default
> > > > 1MB
> > > > > znode limit in ZK.
> > > > >
> > > > > 21. "Broker will remove offline replica from its replica fetcher
> > > > threads."
> > > > > The proposal lets the broker remove the replica from the replica
> > > fetcher
> > > > > thread when it detects a directory failure. An alternative is to
> only
> > > do
> > > > > that until the broker receives the LeaderAndIsrRequest/
> > > > StopReplicaRequest.
> > > > > The benefit of this is that the controller is the only one who
> > decides
> > > > > which replica to be removed from the replica fetcher threads. The
> > > broker
> > > > > also doesn't need additional logic to remove the replica from
> replica
> > > > > fetcher threads. The downside is that in a small window, the
> replica
> > > > fetch
> > > > > thread will keep writing to the failed log directory and may
> pollute
> > > the
> > > > > log4j log.
> > > > >
> > > > > 22. In the current design, there is a potential corner case issue
> > that
> > > > the
> > > > > same partition may exist in more than one log directory at some
> > point.
> > > > > Consider the following steps: (1) a new topic t1 is created and the
> > > > > controller sends LeaderAndIsrRequest to a broker; (2) the broker
> > > creates
> > > > > partition t1-p1 in log dir1; (3) before the broker sends a
> response,
> > it
> > > > > goes down; (4) the broker is restarted with log dir1 unreadable;
> (5)
> > > the
> > > > > broker receives a new LeaderAndIsrRequest and creates partition
> t1-p1
> > > on
> > > > > log dir2; (6) at some point, the broker is restarted with log dir1
> > > fixed.
> > > > > Now partition t1-p1 is in two log dirs. The alternative approach
> > that I
> > > > > suggested above may suffer from a similar corner case issue. Since
> > this
> > > > is
> > > > > rare, if the broker detects this during broker startup, it can
> > probably
> > > > > just log an error and exit. The admin can remove the redundant
> > > partitions
> > > > > manually and then restart the broker.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Sat, Feb 18, 2017 at 9:31 PM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > Could you please let me know if the solutions above could address
> > > your
> > > > > > concern? I really want to move the discussion forward.
> > > > > >
> > > > > > Thanks,
> > > > > > Dong
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin <lindon...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hey Jun,
> > > > > > >
> > > > > > > Thanks for all your help and time to discuss this KIP. When you
> > get
> > > > the
> > > > > > > time, could you let me know if the previous answers address the
> > > > > concern?
> > > > > > >
> > > > > > > I think the more interesting question in your last email is
> where
> > > we
> > > > > > > should store the "created" flag in ZK. I proposed the solution
> > > that I
> > > > > > like
> > > > > > > most, i.e. store it together with the replica assignment data
> in
> > > the
> > > > > > /brokers/topics/[topic].
> > > > > > > In order to expedite discussion, let me provide another two
> ideas
> > > to
> > > > > > > address the concern just in case the first idea doesn't work:
> > > > > > >
> > > > > > > - We can avoid extra controller ZK read when there is no disk
> > > failure
> > > > > > > (95% of time?). When controller starts, it doesn't
> > > > > > > read controller_managed_state in ZK and sends
> LeaderAndIsrRequest
> > > > with
> > > > > > > "create = false". Only if LeaderAndIsrResponse shows failure
> for
> > > any
> > > > > > > replica, then controller will read controller_managed_state for
> > > this
> > > > > > > partition and re-send LeaderAndIsrRequset with "create=true" if
> > > this
> > > > > > > replica has not been created.
> > > > > > >
> > > > > > > - We can significantly reduce this ZK read time by making
> > > > > > > controller_managed_state a topic level information in ZK, e.g.
> > > > > > > /brokers/topics/[topic]/state. Given that most topic has 10+
> > > > partition,
> > > > > > > the extra ZK read time should be less than 10% of the existing
> > > total
> > > > zk
> > > > > > > read time during controller failover.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Dong
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 14, 2017 at 7:30 AM, Dong Lin <lindon...@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > >> Hey Jun,
> > > > > > >>
> > > > > > >> I just realized that you may be suggesting that a tool for
> > listing
> > > > > > >> offline directories is necessary for KIP-112 by asking whether
> > > > KIP-112
> > > > > > and
> > > > > > >> KIP-113 will be in the same release. I think such a tool is
> > useful
> > > > but
> > > > > > >> doesn't have to be included in KIP-112. This is because as of
> > now
> > > > > admin
> > > > > > >> needs to log into broker machine and check broker log to
> figure
> > > out
> > > > > the
> > > > > > >> cause of broker failure and the bad log directory in case of
> > disk
> > > > > > failure.
> > > > > > >> The KIP-112 won't make it harder since admin can still figure
> > out
> > > > the
> > > > > > bad
> > > > > > >> log directory by doing the same thing. Thus it is probably OK
> to
> > > > just
> > > > > > >> include this script in KIP-113. Regardless, my hope is to
> finish
> > > > both
> > > > > > KIPs
> > > > > > >> ASAP and make them in the same release since both KIPs are
> > needed
> > > > for
> > > > > > the
> > > > > > >> JBOD setup.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Dong
> > > > > > >>
> > > > > > >> On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin <
> lindon...@gmail.com>
> > > > > wrote:
> > > > > > >>
> > > > > > >>> And the test plan has also been updated to simulate disk
> > failure
> > > by
> > > > > > >>> changing log directory permission to 000.
> > > > > > >>>
> > > > > > >>> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin <
> lindon...@gmail.com
> > >
> > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Hi Jun,
> > > > > > >>>>
> > > > > > >>>> Thanks for the reply. These comments are very helpful. Let
> me
> > > > answer
> > > > > > >>>> them inline.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao <j...@confluent.io>
> > > > wrote:
> > > > > > >>>>
> > > > > > >>>>> Hi, Dong,
> > > > > > >>>>>
> > > > > > >>>>> Thanks for the reply. A few more replies and new comments
> > > below.
> > > > > > >>>>>
> > > > > > >>>>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin <
> > lindon...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>>>>
> > > > > > >>>>> > Hi Jun,
> > > > > > >>>>> >
> > > > > > >>>>> > Thanks for the detailed comments. Please see answers
> > inline:
> > > > > > >>>>> >
> > > > > > >>>>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao <
> j...@confluent.io
> > >
> > > > > wrote:
> > > > > > >>>>> >
> > > > > > >>>>> > > Hi, Dong,
> > > > > > >>>>> > >
> > > > > > >>>>> > > Thanks for the updated wiki. A few comments below.
> > > > > > >>>>> > >
> > > > > > >>>>> > > 1. Topics get created
> > > > > > >>>>> > > 1.1 Instead of storing successfully created replicas in
> > ZK,
> > > > > could
> > > > > > >>>>> we
> > > > > > >>>>> > store
> > > > > > >>>>> > > unsuccessfully created replicas in ZK? Since the latter
> > is
> > > > less
> > > > > > >>>>> common,
> > > > > > >>>>> > it
> > > > > > >>>>> > > probably reduces the load on ZK.
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > We can store unsuccessfully created replicas in ZK. But I
> > am
> > > > not
> > > > > > >>>>> sure if
> > > > > > >>>>> > that can reduce write load on ZK.
> > > > > > >>>>> >
> > > > > > >>>>> > If we want to reduce write load on ZK using by store
> > > > > unsuccessfully
> > > > > > >>>>> created
> > > > > > >>>>> > replicas in ZK, then broker should not write to ZK if all
> > > > > replicas
> > > > > > >>>>> are
> > > > > > >>>>> > successfully created. It means that if
> > > > > > /broker/topics/[topic]/partiti
> > > > > > >>>>> > ons/[partitionId]/controller_managed_state doesn't exist
> > in
> > > ZK
> > > > > for
> > > > > > >>>>> a given
> > > > > > >>>>> > partition, we have to assume all replicas of this
> partition
> > > > have
> > > > > > been
> > > > > > >>>>> > successfully created and send LeaderAndIsrRequest with
> > > create =
> > > > > > >>>>> false. This
> > > > > > >>>>> > becomes a problem if controller crashes before receiving
> > > > > > >>>>> > LeaderAndIsrResponse to validate whether a replica has
> been
> > > > > > created.
> > > > > > >>>>> >
> > > > > > >>>>> > I think this approach and reduce the number of bytes
> stored
> > > in
> > > > > ZK.
> > > > > > >>>>> But I am
> > > > > > >>>>> > not sure if this is a concern.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> I was mostly concerned about the controller failover time.
> > > > > Currently,
> > > > > > >>>>> the
> > > > > > >>>>> controller failover is likely dominated by the cost of
> > reading
> > > > > > >>>>> topic/partition level information from ZK. If we add
> another
> > > > > > partition
> > > > > > >>>>> level path in ZK, it probably will double the controller
> > > failover
> > > > > > >>>>> time. If
> > > > > > >>>>> the approach of representing the non-created replicas
> doesn't
> > > > work,
> > > > > > >>>>> have
> > > > > > >>>>> you considered just adding the created flag in the
> > leaderAndIsr
> > > > > path
> > > > > > >>>>> in ZK?
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>> Yes, I have considered adding the created flag in the
> > > leaderAndIsr
> > > > > > path
> > > > > > >>>> in ZK. If we were to add created flag per replica in the
> > > > > > >>>> LeaderAndIsrRequest, then it requires a lot of change in the
> > > code
> > > > > > base.
> > > > > > >>>>
> > > > > > >>>> If we don't add created flag per replica in the
> > > > LeaderAndIsrRequest,
> > > > > > >>>> then the information in leaderAndIsr path in ZK and
> > > > > > LeaderAndIsrRequest
> > > > > > >>>> would be different. Further, the procedure for broker to
> > update
> > > > ISR
> > > > > > in ZK
> > > > > > >>>> will be a bit complicated. When leader updates leaderAndIsr
> > path
> > > > in
> > > > > > ZK, it
> > > > > > >>>> will have to first read created flags from ZK, change isr,
> and
> > > > write
> > > > > > >>>> leaderAndIsr back to ZK. And it needs to check znode version
> > and
> > > > > > re-try
> > > > > > >>>> write operation in ZK if controller has updated ZK during
> this
> > > > > > period. This
> > > > > > >>>> is in contrast to the current implementation where the
> leader
> > > > either
> > > > > > gets
> > > > > > >>>> all the information from LeaderAndIsrRequest sent by
> > controller,
> > > > or
> > > > > > >>>> determine the infromation by itself (e.g. ISR), before
> writing
> > > to
> > > > > > >>>> leaderAndIsr path in ZK.
> > > > > > >>>>
> > > > > > >>>> It seems to me that the above solution is a bit complicated
> > and
> > > > not
> > > > > > >>>> clean. Thus I come up with the design in this KIP to store
> > this
> > > > > > created
> > > > > > >>>> flag in a separate zk path. The path is named
> > > > > > controller_managed_state to
> > > > > > >>>> indicate that we can store in this znode all information
> that
> > is
> > > > > > managed by
> > > > > > >>>> controller only, as opposed to ISR.
> > > > > > >>>>
> > > > > > >>>> I agree with your concern of increased ZK read time during
> > > > > controller
> > > > > > >>>> failover. How about we store the "created" information in
> the
> > > > > > >>>> znode /brokers/topics/[topic]? We can change that znode to
> > have
> > > > the
> > > > > > >>>> following data format:
> > > > > > >>>>
> > > > > > >>>> {
> > > > > > >>>>   "version" : 2,
> > > > > > >>>>   "created" : {
> > > > > > >>>>     "1" : [1, 2, 3],
> > > > > > >>>>     ...
> > > > > > >>>>   }
> > > > > > >>>>   "partition" : {
> > > > > > >>>>     "1" : [1, 2, 3],
> > > > > > >>>>     ...
> > > > > > >>>>   }
> > > > > > >>>> }
> > > > > > >>>>
> > > > > > >>>> We won't have extra zk read using this solution. It also
> seems
> > > > > > >>>> reasonable to put the partition assignment information
> > together
> > > > with
> > > > > > >>>> replica creation information. The latter is only changed
> once
> > > > after
> > > > > > the
> > > > > > >>>> partition is created or re-assigned.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> >
> > > > > > >>>>> > > 1.2 If an error is received for a follower, does the
> > > > controller
> > > > > > >>>>> eagerly
> > > > > > >>>>> > > remove it from ISR or do we just let the leader removes
> > it
> > > > > after
> > > > > > >>>>> timeout?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > No, Controller will not actively remove it from ISR. But
> > > > > controller
> > > > > > >>>>> will
> > > > > > >>>>> > recognize it as offline replica and propagate this
> > > information
> > > > to
> > > > > > all
> > > > > > >>>>> > brokers via UpdateMetadataRequest. Each leader can use
> this
> > > > > > >>>>> information to
> > > > > > >>>>> > actively remove offline replica from ISR set. I have
> > updated
> > > to
> > > > > > wiki
> > > > > > >>>>> to
> > > > > > >>>>> > clarify it.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>>
> > > > > > >>>>> That seems inconsistent with how the controller deals with
> > > > offline
> > > > > > >>>>> replicas
> > > > > > >>>>> due to broker failures. When that happens, the broker will
> > (1)
> > > > > select
> > > > > > >>>>> a new
> > > > > > >>>>> leader if the offline replica is the leader; (2) remove the
> > > > replica
> > > > > > >>>>> from
> > > > > > >>>>> ISR if the offline replica is the follower. So,
> intuitively,
> > it
> > > > > seems
> > > > > > >>>>> that
> > > > > > >>>>> we should be doing the same thing when dealing with offline
> > > > > replicas
> > > > > > >>>>> due to
> > > > > > >>>>> disk failure.
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>> My bad. I misunderstand how the controller currently handles
> > > > broker
> > > > > > >>>> failure and ISR change. Yes we should do the same thing when
> > > > dealing
> > > > > > with
> > > > > > >>>> offline replicas here. I have updated the KIP to specify
> that,
> > > > when
> > > > > an
> > > > > > >>>> offline replica is discovered by controller, the controller
> > > > removes
> > > > > > offline
> > > > > > >>>> replicas from ISR in the ZK and sends LeaderAndIsrRequest
> with
> > > > > > updated ISR
> > > > > > >>>> to be used by partition leaders.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> >
> > > > > > >>>>> > > 1.3 Similar, if an error is received for a leader,
> should
> > > the
> > > > > > >>>>> controller
> > > > > > >>>>> > > trigger leader election again?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > Yes, controller will trigger leader election if leader
> > > replica
> > > > is
> > > > > > >>>>> offline.
> > > > > > >>>>> > I have updated the wiki to clarify it.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> > >
> > > > > > >>>>> > > 2. A log directory stops working on a broker during
> > > runtime:
> > > > > > >>>>> > > 2.1 It seems the broker remembers the failed directory
> > > after
> > > > > > >>>>> hitting an
> > > > > > >>>>> > > IOException and the failed directory won't be used for
> > > > creating
> > > > > > new
> > > > > > >>>>> > > partitions until the broker is restarted? If so, could
> > you
> > > > add
> > > > > > >>>>> that to
> > > > > > >>>>> > the
> > > > > > >>>>> > > wiki.
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > Right, broker assumes a log directory to be good after it
> > > > starts,
> > > > > > >>>>> and mark
> > > > > > >>>>> > log directory as bad once there is IOException when
> broker
> > > > > attempts
> > > > > > >>>>> to
> > > > > > >>>>> > access the log directory. New replicas will only be
> created
> > > on
> > > > > good
> > > > > > >>>>> log
> > > > > > >>>>> > directory. I just added this to the KIP.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> > > 2.2 Could you be a bit more specific on how and during
> > > which
> > > > > > >>>>> operation
> > > > > > >>>>> > the
> > > > > > >>>>> > > broker detects directory failure? Is it when the broker
> > > hits
> > > > an
> > > > > > >>>>> > IOException
> > > > > > >>>>> > > during writes, or both reads and writes?  For example,
> > > during
> > > > > > >>>>> broker
> > > > > > >>>>> > > startup, it only reads from each of the log
> directories,
> > if
> > > > it
> > > > > > >>>>> hits an
> > > > > > >>>>> > > IOException there, does the broker immediately mark the
> > > > > directory
> > > > > > >>>>> as
> > > > > > >>>>> > > offline?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > Broker marks log directory as bad once there is
> IOException
> > > > when
> > > > > > >>>>> broker
> > > > > > >>>>> > attempts to access the log directory. This includes read
> > and
> > > > > write.
> > > > > > >>>>> These
> > > > > > >>>>> > operations include log append, log read, log cleaning,
> > > > watermark
> > > > > > >>>>> checkpoint
> > > > > > >>>>> > etc. If broker hits IOException when it reads from each
> of
> > > the
> > > > > log
> > > > > > >>>>> > directory during startup, it immediately mark the
> directory
> > > as
> > > > > > >>>>> offline.
> > > > > > >>>>> >
> > > > > > >>>>> > I just updated the KIP to clarify it.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> > > 3. Partition reassignment: If we know a replica is
> > offline,
> > > > do
> > > > > we
> > > > > > >>>>> still
> > > > > > >>>>> > > want to send StopReplicaRequest to it?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > No, controller doesn't send StopReplicaRequest for an
> > offline
> > > > > > >>>>> replica.
> > > > > > >>>>> > Controller treats this scenario in the same way that
> > exiting
> > > > > Kafka
> > > > > > >>>>> > implementation does when the broker of this replica is
> > > offline.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> > >
> > > > > > >>>>> > > 4. UpdateMetadataRequestPartitionState: For
> > > > offline_replicas,
> > > > > do
> > > > > > >>>>> they
> > > > > > >>>>> > only
> > > > > > >>>>> > > include offline replicas due to log directory failures
> or
> > > do
> > > > > they
> > > > > > >>>>> also
> > > > > > >>>>> > > include offline replicas due to broker failure?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > UpdateMetadataRequestPartitionState's offline_replicas
> > > include
> > > > > > >>>>> offline
> > > > > > >>>>> > replicas due to both log directory failure and broker
> > > failure.
> > > > > This
> > > > > > >>>>> is to
> > > > > > >>>>> > make the semantics of this field easier to understand.
> > Broker
> > > > can
> > > > > > >>>>> > distinguish whether a replica is offline due to broker
> > > failure
> > > > or
> > > > > > >>>>> disk
> > > > > > >>>>> > failure by checking whether a broker is live in the
> > > > > > >>>>> UpdateMetadataRequest.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> > >
> > > > > > >>>>> > > 5. Tools: Could we add some kind of support in the tool
> > to
> > > > list
> > > > > > >>>>> offline
> > > > > > >>>>> > > directories?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > In KIP-112 we don't have tools to list offline
> directories
> > > > > because
> > > > > > >>>>> we have
> > > > > > >>>>> > intentionally avoided exposing log directory information
> > > (e.g.
> > > > > log
> > > > > > >>>>> > directory path) to user or other brokers. I think we can
> > add
> > > > this
> > > > > > >>>>> feature
> > > > > > >>>>> > in KIP-113, in which we will have DescribeDirsRequest to
> > list
> > > > log
> > > > > > >>>>> directory
> > > > > > >>>>> > information (e.g. partition assignment, path, size)
> needed
> > > for
> > > > > > >>>>> rebalance.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> Since we are introducing a new failure mode, if a replica
> > > becomes
> > > > > > >>>>> offline
> > > > > > >>>>> due to failure in log directories, the first thing an admin
> > > wants
> > > > > to
> > > > > > >>>>> know
> > > > > > >>>>> is which log directories are offline from the broker's
> > > > perspective.
> > > > > > >>>>> So,
> > > > > > >>>>> including such a tool will be useful. Do you plan to do
> > KIP-112
> > > > and
> > > > > > >>>>> KIP-113
> > > > > > >>>>>  in the same release?
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>> Yes, I agree that including such a tool is using. This is
> > > probably
> > > > > > >>>> better to be added in KIP-113 because we need
> > > DescribeDirsRequest
> > > > to
> > > > > > get
> > > > > > >>>> this information. I will update KIP-113 to include this
> tool.
> > > > > > >>>>
> > > > > > >>>> I plan to do KIP-112 and KIP-113 separately to make each KIP
> > and
> > > > > their
> > > > > > >>>> patch easier to review. I don't have any plan about which
> > > release
> > > > to
> > > > > > have
> > > > > > >>>> these KIPs. My plan is to both of them ASAP. Is there
> > particular
> > > > > > timeline
> > > > > > >>>> you prefer for code of these two KIPs to checked-in?
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>> >
> > > > > > >>>>> > >
> > > > > > >>>>> > > 6. Metrics: Could we add some metrics to show offline
> > > > > > directories?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > Sure. I think it makes sense to have each broker report
> its
> > > > > number
> > > > > > of
> > > > > > >>>>> > offline replicas and offline log directories. The
> previous
> > > > metric
> > > > > > >>>>> was put
> > > > > > >>>>> > in KIP-113. I just added both metrics in KIP-112.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> > >
> > > > > > >>>>> > > 7. There are still references to kafka-log-dirs.sh. Are
> > > they
> > > > > > valid?
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > My bad. I just removed this from "Changes in Operational
> > > > > > Procedures"
> > > > > > >>>>> and
> > > > > > >>>>> > "Test Plan" in the KIP.
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> > >
> > > > > > >>>>> > > 8. Do you think KIP-113 is ready for review? One thing
> > that
> > > > > > KIP-113
> > > > > > >>>>> > > mentions during partition reassignment is to first send
> > > > > > >>>>> > > LeaderAndIsrRequest, followed by
> ChangeReplicaDirRequest.
> > > It
> > > > > > seems
> > > > > > >>>>> it's
> > > > > > >>>>> > > better if the replicas are created in the right log
> > > directory
> > > > > in
> > > > > > >>>>> the
> > > > > > >>>>> > first
> > > > > > >>>>> > > place? The reason that I brought it up here is because
> it
> > > may
> > > > > > >>>>> affect the
> > > > > > >>>>> > > protocol of LeaderAndIsrRequest.
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>> > Yes, KIP-113 is ready for review. The advantage of the
> > > current
> > > > > > >>>>> design is
> > > > > > >>>>> > that we can keep LeaderAndIsrRequest
> > log-direcotry-agnostic.
> > > > The
> > > > > > >>>>> > implementation would be much easier to read if all log
> > > related
> > > > > > logic
> > > > > > >>>>> (e.g.
> > > > > > >>>>> > various errors) are put in ChangeReplicadIRrequest and
> the
> > > code
> > > > > > path
> > > > > > >>>>> of
> > > > > > >>>>> > handling replica movement is separated from leadership
> > > > handling.
> > > > > > >>>>> >
> > > > > > >>>>> > In other words, I think Kafka may be easier to develop in
> > the
> > > > > long
> > > > > > >>>>> term if
> > > > > > >>>>> > we separate these two requests.
> > > > > > >>>>> >
> > > > > > >>>>> > I agree that ideally we want to create replicas in the
> > right
> > > > log
> > > > > > >>>>> directory
> > > > > > >>>>> > in the first place. But I am not sure if there is any
> > > > performance
> > > > > > or
> > > > > > >>>>> > correctness concern with the existing way of moving it
> > after
> > > it
> > > > > is
> > > > > > >>>>> created.
> > > > > > >>>>> > Besides, does this decision affect the change proposed in
> > > > > KIP-112?
> > > > > > >>>>> >
> > > > > > >>>>> >
> > > > > > >>>>> I am just wondering if you have considered including the
> log
> > > > > > directory
> > > > > > >>>>> for
> > > > > > >>>>> the replicas in the LeaderAndIsrRequest.
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>> Yeah I have thought about this idea, but only briefly. I
> > > rejected
> > > > > this
> > > > > > >>>> idea because log directory is broker's local information
> and I
> > > > > prefer
> > > > > > not
> > > > > > >>>> to expose local config information to the cluster through
> > > > > > >>>> LeaderAndIsrRequest.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>> 9. Could you describe when the offline replicas due to log
> > > > > directory
> > > > > > >>>>> failure are removed from the replica fetch threads?
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>> Yes. If the offline replica was a leader, either a new
> leader
> > is
> > > > > > >>>> elected or all follower brokers will stop fetching for this
> > > > > > partition. If
> > > > > > >>>> the offline replica is a follower, the broker will stop
> > fetching
> > > > for
> > > > > > this
> > > > > > >>>> replica immediately. A broker stops fetching data for a
> > replica
> > > by
> > > > > > removing
> > > > > > >>>> the replica from the replica fetch threads. I have updated
> the
> > > KIP
> > > > > to
> > > > > > >>>> clarify it.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> 10. The wiki mentioned changing the log directory to a file
> > for
> > > > > > >>>>> simulating
> > > > > > >>>>> disk failure in system tests. Could we just change the
> > > permission
> > > > > of
> > > > > > >>>>> the
> > > > > > >>>>> log directory to 000 to simulate that?
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> Sure,
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> Thanks,
> > > > > > >>>>>
> > > > > > >>>>> Jun
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> > > Jun
> > > > > > >>>>> > >
> > > > > > >>>>> > > On Fri, Feb 10, 2017 at 9:53 AM, Dong Lin <
> > > > lindon...@gmail.com
> > > > > >
> > > > > > >>>>> wrote:
> > > > > > >>>>> > >
> > > > > > >>>>> > > > Hi Jun,
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Can I replace zookeeper access with direct RPC for
> both
> > > ISR
> > > > > > >>>>> > notification
> > > > > > >>>>> > > > and disk failure notification in a future KIP, or do
> > you
> > > > feel
> > > > > > we
> > > > > > >>>>> should
> > > > > > >>>>> > > do
> > > > > > >>>>> > > > it in this KIP?
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Hi Eno, Grant and everyone,
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Is there further improvement you would like to see
> with
> > > > this
> > > > > > KIP?
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Thanks you all for the comments,
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Dong
> > > > > > >>>>> > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin <
> > > > > lindon...@gmail.com>
> > > > > > >>>>> wrote:
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > > On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe <
> > > > > > >>>>> cmcc...@apache.org>
> > > > > > >>>>> > > wrote:
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > >> On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
> > > > > > >>>>> > > > >> > Thanks for all the comments Colin!
> > > > > > >>>>> > > > >> >
> > > > > > >>>>> > > > >> > To answer your questions:
> > > > > > >>>>> > > > >> > - Yes, a broker will shutdown if all its log
> > > > directories
> > > > > > >>>>> are bad.
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >> That makes sense.  Can you add this to the
> writeup?
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > > Sure. This has already been added. You can find it
> > here
> > > > > > >>>>> > > > > <https://cwiki.apache.org/confluence/pages/
> > > > > diffpagesbyversio
> > > > > > >>>>> n.action
> > > > > > >>>>> > ?
> > > > > > >>>>> > > > pageId=67638402&selectedPageVersions=9&
> > > > selectedPageVersions=
> > > > > > 10>
> > > > > > >>>>> > > > > .
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >> > - I updated the KIP to explicitly state that a
> log
> > > > > > >>>>> directory will
> > > > > > >>>>> > be
> > > > > > >>>>> > > > >> > assumed to be good until broker sees IOException
> > > when
> > > > it
> > > > > > >>>>> tries to
> > > > > > >>>>> > > > access
> > > > > > >>>>> > > > >> > the log directory.
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >> Thanks.
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >> > - Controller doesn't explicitly know whether
> there
> > > is
> > > > > new
> > > > > > >>>>> log
> > > > > > >>>>> > > > directory
> > > > > > >>>>> > > > >> > or
> > > > > > >>>>> > > > >> > not. All controller knows is whether replicas
> are
> > > > online
> > > > > > or
> > > > > > >>>>> > offline
> > > > > > >>>>> > > > >> based
> > > > > > >>>>> > > > >> > on LeaderAndIsrResponse. According to the
> existing
> > > > Kafka
> > > > > > >>>>> > > > implementation,
> > > > > > >>>>> > > > >> > controller will always send LeaderAndIsrRequest
> > to a
> > > > > > broker
> > > > > > >>>>> after
> > > > > > >>>>> > it
> > > > > > >>>>> > > > >> > bounces.
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >> I thought so.  It's good to clarify, though.  Do
> you
> > > > think
> > > > > > >>>>> it's
> > > > > > >>>>> > worth
> > > > > > >>>>> > > > >> adding a quick discussion of this on the wiki?
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > > Personally I don't think it is needed. If broker
> > starts
> > > > > with
> > > > > > >>>>> no bad
> > > > > > >>>>> > log
> > > > > > >>>>> > > > > directory, everything should work it is and we
> should
> > > not
> > > > > > need
> > > > > > >>>>> to
> > > > > > >>>>> > > clarify
> > > > > > >>>>> > > > > it. The KIP has already covered the scenario when a
> > > > broker
> > > > > > >>>>> starts
> > > > > > >>>>> > with
> > > > > > >>>>> > > > bad
> > > > > > >>>>> > > > > log directory. Also, the KIP doesn't claim or hint
> > that
> > > > we
> > > > > > >>>>> support
> > > > > > >>>>> > > > dynamic
> > > > > > >>>>> > > > > addition of new log directories. I think we are
> good.
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > >> best,
> > > > > > >>>>> > > > >> Colin
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >> >
> > > > > > >>>>> > > > >> > Please see this
> > > > > > >>>>> > > > >> > <https://cwiki.apache.org/conf
> > > > > > >>>>> luence/pages/diffpagesbyversio
> > > > > > >>>>> > > > >> n.action?pageId=67638402&selectedPageVersions=9&
> > > > > > >>>>> > > > selectedPageVersions=10>
> > > > > > >>>>> > > > >> > for the change of the KIP.
> > > > > > >>>>> > > > >> >
> > > > > > >>>>> > > > >> > On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe <
> > > > > > >>>>> cmcc...@apache.org
> > > > > > >>>>> > >
> > > > > > >>>>> > > > >> wrote:
> > > > > > >>>>> > > > >> >
> > > > > > >>>>> > > > >> > > On Thu, Feb 9, 2017, at 11:03, Colin McCabe
> > wrote:
> > > > > > >>>>> > > > >> > > > Thanks, Dong L.
> > > > > > >>>>> > > > >> > > >
> > > > > > >>>>> > > > >> > > > Do we plan on bringing down the broker
> process
> > > > when
> > > > > > all
> > > > > > >>>>> log
> > > > > > >>>>> > > > >> directories
> > > > > > >>>>> > > > >> > > > are offline?
> > > > > > >>>>> > > > >> > > >
> > > > > > >>>>> > > > >> > > > Can you explicitly state on the KIP that the
> > log
> > > > > dirs
> > > > > > >>>>> are all
> > > > > > >>>>> > > > >> considered
> > > > > > >>>>> > > > >> > > > good after the broker process is bounced?
> It
> > > > seems
> > > > > > >>>>> like an
> > > > > > >>>>> > > > >> important
> > > > > > >>>>> > > > >> > > > thing to be clear about.  Also, perhaps
> > discuss
> > > > how
> > > > > > the
> > > > > > >>>>> > > controller
> > > > > > >>>>> > > > >> > > > becomes aware of the newly good log
> > directories
> > > > > after
> > > > > > a
> > > > > > >>>>> broker
> > > > > > >>>>> > > > >> bounce
> > > > > > >>>>> > > > >> > > > (and whether this triggers re-election).
> > > > > > >>>>> > > > >> > >
> > > > > > >>>>> > > > >> > > I meant to write, all the log dirs where the
> > > broker
> > > > > can
> > > > > > >>>>> still
> > > > > > >>>>> > read
> > > > > > >>>>> > > > the
> > > > > > >>>>> > > > >> > > index and some other files.  Clearly, log dirs
> > > that
> > > > > are
> > > > > > >>>>> > completely
> > > > > > >>>>> > > > >> > > inaccessible will still be considered bad
> after
> > a
> > > > > broker
> > > > > > >>>>> process
> > > > > > >>>>> > > > >> bounce.
> > > > > > >>>>> > > > >> > >
> > > > > > >>>>> > > > >> > > best,
> > > > > > >>>>> > > > >> > > Colin
> > > > > > >>>>> > > > >> > >
> > > > > > >>>>> > > > >> > > >
> > > > > > >>>>> > > > >> > > > +1 (non-binding) aside from that
> > > > > > >>>>> > > > >> > > >
> > > > > > >>>>> > > > >> > > >
> > > > > > >>>>> > > > >> > > >
> > > > > > >>>>> > > > >> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin
> wrote:
> > > > > > >>>>> > > > >> > > > > Hi all,
> > > > > > >>>>> > > > >> > > > >
> > > > > > >>>>> > > > >> > > > > Thank you all for the helpful suggestion.
> I
> > > have
> > > > > > >>>>> updated the
> > > > > > >>>>> > > KIP
> > > > > > >>>>> > > > >> to
> > > > > > >>>>> > > > >> > > > > address
> > > > > > >>>>> > > > >> > > > > the comments received so far. See here
> > > > > > >>>>> > > > >> > > > > <https://cwiki.apache.org/conf
> > > > > > >>>>> > luence/pages/diffpagesbyversio
> > > > > > >>>>> > > > >> n.action?
> > > > > > >>>>> > > > >> > > pageId=67638402&selectedPageVe
> > > > > > >>>>> rsions=8&selectedPageVersions=
> > > > > > >>>>> > 9>to
> > > > > > >>>>> > > > >> > > > > read the changes of the KIP. Here is a
> > summary
> > > > of
> > > > > > >>>>> change:
> > > > > > >>>>> > > > >> > > > >
> > > > > > >>>>> > > > >> > > > > - Updated the Proposed Change section to
> > > change
> > > > > the
> > > > > > >>>>> recovery
> > > > > > >>>>> > > > >> steps.
> > > > > > >>>>> > > > >> > > After
> > > > > > >>>>> > > > >> > > > > this change, broker will also create
> replica
> > > as
> > > > > long
> > > > > > >>>>> as all
> > > > > > >>>>> > > log
> > > > > > >>>>> > > > >> > > > > directories
> > > > > > >>>>> > > > >> > > > > are working.
> > > > > > >>>>> > > > >> > > > > - Removed kafka-log-dirs.sh from this KIP
> > > since
> > > > > user
> > > > > > >>>>> no
> > > > > > >>>>> > longer
> > > > > > >>>>> > > > >> needs to
> > > > > > >>>>> > > > >> > > > > use
> > > > > > >>>>> > > > >> > > > > it for recovery from bad disks.
> > > > > > >>>>> > > > >> > > > > - Explained how the znode
> > > > controller_managed_state
> > > > > > is
> > > > > > >>>>> > managed
> > > > > > >>>>> > > in
> > > > > > >>>>> > > > >> the
> > > > > > >>>>> > > > >> > > > > Public
> > > > > > >>>>> > > > >> > > > > interface section.
> > > > > > >>>>> > > > >> > > > > - Explained what happens during controller
> > > > > failover,
> > > > > > >>>>> > partition
> > > > > > >>>>> > > > >> > > > > reassignment
> > > > > > >>>>> > > > >> > > > > and topic deletion in the Proposed Change
> > > > section.
> > > > > > >>>>> > > > >> > > > > - Updated Future Work section to include
> the
> > > > > > following
> > > > > > >>>>> > > potential
> > > > > > >>>>> > > > >> > > > > improvements
> > > > > > >>>>> > > > >> > > > >   - Let broker notify controller of ISR
> > change
> > > > and
> > > > > > >>>>> disk
> > > > > > >>>>> > state
> > > > > > >>>>> > > > >> change
> > > > > > >>>>> > > > >> > > via
> > > > > > >>>>> > > > >> > > > > RPC instead of using zookeeper
> > > > > > >>>>> > > > >> > > > >   - Handle various failure scenarios (e.g.
> > > slow
> > > > > > disk)
> > > > > > >>>>> on a
> > > > > > >>>>> > > > >> case-by-case
> > > > > > >>>>> > > > >> > > > > basis. For example, we may want to detect
> > slow
> > > > > disk
> > > > > > >>>>> and
> > > > > > >>>>> > > consider
> > > > > > >>>>> > > > >> it as
> > > > > > >>>>> > > > >> > > > > offline.
> > > > > > >>>>> > > > >> > > > >   - Allow admin to mark a directory as bad
> > so
> > > > that
> > > > > > it
> > > > > > >>>>> will
> > > > > > >>>>> > not
> > > > > > >>>>> > > > be
> > > > > > >>>>> > > > >> used.
> > > > > > >>>>> > > > >> > > > >
> > > > > > >>>>> > > > >> > > > > Thanks,
> > > > > > >>>>> > > > >> > > > > Dong
> > > > > > >>>>> > > > >> > > > >
> > > > > > >>>>> > > > >> > > > >
> > > > > > >>>>> > > > >> > > > >
> > > > > > >>>>> > > > >> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin <
> > > > > > >>>>> > lindon...@gmail.com
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > >> wrote:
> > > > > > >>>>> > > > >> > > > >
> > > > > > >>>>> > > > >> > > > > > Hey Eno,
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > Thanks much for the comment!
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > I still think the complexity added to
> > Kafka
> > > is
> > > > > > >>>>> justified
> > > > > > >>>>> > by
> > > > > > >>>>> > > > its
> > > > > > >>>>> > > > >> > > benefit.
> > > > > > >>>>> > > > >> > > > > > Let me provide my reasons below.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > 1) The additional logic is easy to
> > > understand
> > > > > and
> > > > > > >>>>> thus its
> > > > > > >>>>> > > > >> complexity
> > > > > > >>>>> > > > >> > > > > > should be reasonable.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > On the broker side, it needs to catch
> > > > exception
> > > > > > when
> > > > > > >>>>> > access
> > > > > > >>>>> > > > log
> > > > > > >>>>> > > > >> > > directory,
> > > > > > >>>>> > > > >> > > > > > mark log directory and all its replicas
> as
> > > > > > offline,
> > > > > > >>>>> notify
> > > > > > >>>>> > > > >> > > controller by
> > > > > > >>>>> > > > >> > > > > > writing the zookeeper notification path,
> > and
> > > > > > >>>>> specify error
> > > > > > >>>>> > > in
> > > > > > >>>>> > > > >> > > > > > LeaderAndIsrResponse. On the controller
> > > side,
> > > > it
> > > > > > >>>>> will
> > > > > > >>>>> > > listener
> > > > > > >>>>> > > > >> to
> > > > > > >>>>> > > > >> > > > > > zookeeper for disk failure notification,
> > > learn
> > > > > > about
> > > > > > >>>>> > offline
> > > > > > >>>>> > > > >> > > replicas in
> > > > > > >>>>> > > > >> > > > > > the LeaderAndIsrResponse, and take
> offline
> > > > > > replicas
> > > > > > >>>>> into
> > > > > > >>>>> > > > >> > > consideration when
> > > > > > >>>>> > > > >> > > > > > electing leaders. It also mark replica
> as
> > > > > created
> > > > > > in
> > > > > > >>>>> > > zookeeper
> > > > > > >>>>> > > > >> and
> > > > > > >>>>> > > > >> > > use it
> > > > > > >>>>> > > > >> > > > > > to determine whether a replica is
> created.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > That is all the logic we need to add in
> > > > Kafka. I
> > > > > > >>>>> > personally
> > > > > > >>>>> > > > feel
> > > > > > >>>>> > > > >> > > this is
> > > > > > >>>>> > > > >> > > > > > easy to reason about.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > 2) The additional code is not much.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > I expect the code for KIP-112 to be
> around
> > > > 1100
> > > > > > >>>>> lines new
> > > > > > >>>>> > > > code.
> > > > > > >>>>> > > > >> > > Previously
> > > > > > >>>>> > > > >> > > > > > I have implemented a prototype of a
> > slightly
> > > > > > >>>>> different
> > > > > > >>>>> > > design
> > > > > > >>>>> > > > >> (see
> > > > > > >>>>> > > > >> > > here
> > > > > > >>>>> > > > >> > > > > > <https://docs.google.com/docum
> > > > > > >>>>> ent/d/1Izza0SBmZMVUBUt9s_
> > > > > > >>>>> > > > >> > > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
> > > > > > >>>>> > > > >> > > > > > and uploaded it to github (see here
> > > > > > >>>>> > > > >> > > > > > <https://github.com/lindong28/
> > > kafka/tree/JBOD
> > > > > >).
> > > > > > >>>>> The
> > > > > > >>>>> > patch
> > > > > > >>>>> > > > >> changed
> > > > > > >>>>> > > > >> > > 33
> > > > > > >>>>> > > > >> > > > > > files, added 1185 lines and deleted 183
> > > lines.
> > > > > The
> > > > > > >>>>> size of
> > > > > > >>>>> > > > >> prototype
> > > > > > >>>>> > > > >> > > patch
> > > > > > >>>>> > > > >> > > > > > is actually smaller than patch of
> KIP-107
> > > (see
> > > > > > here
> > > > > > >>>>> > > > >> > > > > > <https://github.com/apache/
> > kafka/pull/2476
> > > >)
> > > > > > which
> > > > > > >>>>> is
> > > > > > >>>>> > > already
> > > > > > >>>>> > > > >> > > accepted.
> > > > > > >>>>> > > > >> > > > > > The KIP-107 patch changed 49 files,
> added
> > > 1349
> > > > > > >>>>> lines and
> > > > > > >>>>> > > > >> deleted 141
> > > > > > >>>>> > > > >> > > lines.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > 3) Comparison with
> > > > > one-broker-per-multiple-volume
> > > > > > s
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > This KIP can improve the availability of
> > > Kafka
> > > > > in
> > > > > > >>>>> this
> > > > > > >>>>> > case
> > > > > > >>>>> > > > such
> > > > > > >>>>> > > > >> > > that one
> > > > > > >>>>> > > > >> > > > > > failed volume doesn't bring down the
> > entire
> > > > > > broker.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > 4) Comparison with one-broker-per-volume
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > If each volume maps to multiple disks,
> > then
> > > we
> > > > > > >>>>> still have
> > > > > > >>>>> > > > >> similar
> > > > > > >>>>> > > > >> > > problem
> > > > > > >>>>> > > > >> > > > > > such that the broker will fail if any
> disk
> > > of
> > > > > the
> > > > > > >>>>> volume
> > > > > > >>>>> > > > failed.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > If each volume maps to one disk, it
> means
> > > that
> > > > > we
> > > > > > >>>>> need to
> > > > > > >>>>> > > > >> deploy 10
> > > > > > >>>>> > > > >> > > > > > brokers on a machine if the machine has
> 10
> > > > > disks.
> > > > > > I
> > > > > > >>>>> will
> > > > > > >>>>> > > > >> explain the
> > > > > > >>>>> > > > >> > > > > > concern with this approach in order of
> > their
> > > > > > >>>>> importance.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > - It is weird if we were to tell kafka
> > user
> > > to
> > > > > > >>>>> deploy 50
> > > > > > >>>>> > > > >> brokers on a
> > > > > > >>>>> > > > >> > > > > > machine of 50 disks.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > - Either when user deploys Kafka on a
> > > > commercial
> > > > > > >>>>> cloud
> > > > > > >>>>> > > > platform
> > > > > > >>>>> > > > >> or
> > > > > > >>>>> > > > >> > > when
> > > > > > >>>>> > > > >> > > > > > user deploys their own cluster, the size
> > or
> > > > > > largest
> > > > > > >>>>> disk
> > > > > > >>>>> > is
> > > > > > >>>>> > > > >> usually
> > > > > > >>>>> > > > >> > > > > > limited. There will be scenarios where
> > user
> > > > want
> > > > > > to
> > > > > > >>>>> > increase
> > > > > > >>>>> > > > >> broker
> > > > > > >>>>> > > > >> > > > > > capacity by having multiple disks per
> > > broker.
> > > > > This
> > > > > > >>>>> JBOD
> > > > > > >>>>> > KIP
> > > > > > >>>>> > > > >> makes it
> > > > > > >>>>> > > > >> > > > > > feasible without hurting availability
> due
> > to
> > > > > > single
> > > > > > >>>>> disk
> > > > > > >>>>> > > > >> failure.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > - Automatic load rebalance across disks
> > will
> > > > be
> > > > > > >>>>> easier and
> > > > > > >>>>> > > > more
> > > > > > >>>>> > > > >> > > flexible
> > > > > > >>>>> > > > >> > > > > > if one broker has multiple disks. This
> can
> > > be
> > > > > > >>>>> future work.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > - There is performance concern when you
> > > deploy
> > > > > 10
> > > > > > >>>>> broker
> > > > > > >>>>> > > vs. 1
> > > > > > >>>>> > > > >> > > broker on
> > > > > > >>>>> > > > >> > > > > > one machine. The metadata the cluster,
> > > > including
> > > > > > >>>>> > > FetchRequest,
> > > > > > >>>>> > > > >> > > > > > ProduceResponse, MetadataRequest and so
> on
> > > > will
> > > > > > all
> > > > > > >>>>> be 10X
> > > > > > >>>>> > > > >> more. The
> > > > > > >>>>> > > > >> > > > > > packet-per-second will be 10X higher
> which
> > > may
> > > > > > limit
> > > > > > >>>>> > > > >> performance if
> > > > > > >>>>> > > > >> > > pps is
> > > > > > >>>>> > > > >> > > > > > the performance bottleneck. The number
> of
> > > > socket
> > > > > > on
> > > > > > >>>>> the
> > > > > > >>>>> > > > machine
> > > > > > >>>>> > > > >> is
> > > > > > >>>>> > > > >> > > 10X
> > > > > > >>>>> > > > >> > > > > > higher. And the number of replication
> > thread
> > > > > will
> > > > > > >>>>> be 100X
> > > > > > >>>>> > > > more.
> > > > > > >>>>> > > > >> The
> > > > > > >>>>> > > > >> > > impact
> > > > > > >>>>> > > > >> > > > > > will be more significant with increasing
> > > > number
> > > > > of
> > > > > > >>>>> disks
> > > > > > >>>>> > per
> > > > > > >>>>> > > > >> > > machine. Thus
> > > > > > >>>>> > > > >> > > > > > it will limit Kakfa's scalability in the
> > > long
> > > > > > term.
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > Thanks,
> > > > > > >>>>> > > > >> > > > > > Dong
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > > On Tue, Feb 7, 2017 at 1:51 AM, Eno
> > > Thereska <
> > > > > > >>>>> > > > >> eno.there...@gmail.com
> > > > > > >>>>> > > > >> > > >
> > > > > > >>>>> > > > >> > > > > > wrote:
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > > > > >> Hi Dong,
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >> To simplify the discussion today, on my
> > > part
> > > > > I'll
> > > > > > >>>>> zoom
> > > > > > >>>>> > into
> > > > > > >>>>> > > > one
> > > > > > >>>>> > > > >> > > thing
> > > > > > >>>>> > > > >> > > > > >> only:
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >> - I'll discuss the options called
> below :
> > > > > > >>>>> > > > >> "one-broker-per-disk" or
> > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks".
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >> - I completely buy the JBOD vs RAID
> > > arguments
> > > > > so
> > > > > > >>>>> there is
> > > > > > >>>>> > > no
> > > > > > >>>>> > > > >> need to
> > > > > > >>>>> > > > >> > > > > >> discuss that part for me. I buy it that
> > > JBODs
> > > > > are
> > > > > > >>>>> good.
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >> I find the terminology can be improved
> a
> > > bit.
> > > > > > >>>>> Ideally
> > > > > > >>>>> > we'd
> > > > > > >>>>> > > be
> > > > > > >>>>> > > > >> > > talking
> > > > > > >>>>> > > > >> > > > > >> about volumes, not disks. Just to make
> it
> > > > clear
> > > > > > >>>>> that
> > > > > > >>>>> > Kafka
> > > > > > >>>>> > > > >> > > understand
> > > > > > >>>>> > > > >> > > > > >> volumes/directories, not individual raw
> > > > disks.
> > > > > So
> > > > > > >>>>> by
> > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks" what I mean
> is
> > > > that
> > > > > > the
> > > > > > >>>>> admin
> > > > > > >>>>> > > can
> > > > > > >>>>> > > > >> pool a
> > > > > > >>>>> > > > >> > > few
> > > > > > >>>>> > > > >> > > > > >> disks together to create a
> > volume/directory
> > > > and
> > > > > > >>>>> give that
> > > > > > >>>>> > > to
> > > > > > >>>>> > > > >> Kafka.
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >> The kernel of my question will be that
> > the
> > > > > admin
> > > > > > >>>>> already
> > > > > > >>>>> > > has
> > > > > > >>>>> > > > >> tools
> > > > > > >>>>> > > > >> > > to 1)
> > > > > > >>>>> > > > >> > > > > >> create volumes/directories from a JBOD
> > and
> > > 2)
> > > > > > >>>>> start a
> > > > > > >>>>> > > broker
> > > > > > >>>>> > > > >> on a
> > > > > > >>>>> > > > >> > > desired
> > > > > > >>>>> > > > >> > > > > >> machine and 3) assign a broker
> resources
> > > > like a
> > > > > > >>>>> > directory.
> > > > > > >>>>> > > I
> > > > > > >>>>> > > > >> claim
> > > > > > >>>>> > > > >> > > that
> > > > > > >>>>> > > > >> > > > > >> those tools are sufficient to optimise
> > > > resource
> > > > > > >>>>> > allocation.
> > > > > > >>>>> > > > I
> > > > > > >>>>> > > > >> > > understand
> > > > > > >>>>> > > > >> > > > > >> that a broker could manage point 3)
> > itself,
> > > > ie
> > > > > > >>>>> juggle the
> > > > > > >>>>> > > > >> > > directories. My
> > > > > > >>>>> > > > >> > > > > >> question is whether the complexity
> added
> > to
> > > > > Kafka
> > > > > > >>>>> is
> > > > > > >>>>> > > > justified.
> > > > > > >>>>> > > > >> > > > > >> Operationally it seems to me an admin
> > will
> > > > > still
> > > > > > >>>>> have to
> > > > > > >>>>> > do
> > > > > > >>>>> > > > >> all the
> > > > > > >>>>> > > > >> > > three
> > > > > > >>>>> > > > >> > > > > >> items above.
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >> Looking forward to the discussion
> > > > > > >>>>> > > > >> > > > > >> Thanks
> > > > > > >>>>> > > > >> > > > > >> Eno
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >> > On 1 Feb 2017, at 17:21, Dong Lin <
> > > > > > >>>>> lindon...@gmail.com
> > > > > > >>>>> > >
> > > > > > >>>>> > > > >> wrote:
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > Hey Eno,
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > Thanks much for the review.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > I think your suggestion is to split
> > disks
> > > > of
> > > > > a
> > > > > > >>>>> machine
> > > > > > >>>>> > > into
> > > > > > >>>>> > > > >> > > multiple
> > > > > > >>>>> > > > >> > > > > >> disk
> > > > > > >>>>> > > > >> > > > > >> > sets and run one broker per disk set.
> > > Yeah
> > > > > this
> > > > > > >>>>> is
> > > > > > >>>>> > > similar
> > > > > > >>>>> > > > to
> > > > > > >>>>> > > > >> > > Colin's
> > > > > > >>>>> > > > >> > > > > >> > suggestion of one-broker-per-disk,
> > which
> > > we
> > > > > > have
> > > > > > >>>>> > > evaluated
> > > > > > >>>>> > > > at
> > > > > > >>>>> > > > >> > > LinkedIn
> > > > > > >>>>> > > > >> > > > > >> and
> > > > > > >>>>> > > > >> > > > > >> > considered it to be a good short term
> > > > > approach.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > As of now I don't think any of these
> > > > approach
> > > > > > is
> > > > > > >>>>> a
> > > > > > >>>>> > better
> > > > > > >>>>> > > > >> > > alternative in
> > > > > > >>>>> > > > >> > > > > >> > the long term. I will summarize these
> > > > here. I
> > > > > > >>>>> have put
> > > > > > >>>>> > > > these
> > > > > > >>>>> > > > >> > > reasons in
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> > KIP's motivation section and rejected
> > > > > > alternative
> > > > > > >>>>> > > section.
> > > > > > >>>>> > > > I
> > > > > > >>>>> > > > >> am
> > > > > > >>>>> > > > >> > > happy to
> > > > > > >>>>> > > > >> > > > > >> > discuss more and I would certainly
> like
> > > to
> > > > > use
> > > > > > an
> > > > > > >>>>> > > > alternative
> > > > > > >>>>> > > > >> > > solution
> > > > > > >>>>> > > > >> > > > > >> that
> > > > > > >>>>> > > > >> > > > > >> > is easier to do with better
> > performance.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-10: if we switch from
> > > > RAID-10
> > > > > > >>>>> with
> > > > > > >>>>> > > > >> > > > > >> replication-factoer=2 to
> > > > > > >>>>> > > > >> > > > > >> > JBOD with replicatio-factor=3, we get
> > 25%
> > > > > > >>>>> reduction in
> > > > > > >>>>> > > disk
> > > > > > >>>>> > > > >> usage
> > > > > > >>>>> > > > >> > > and
> > > > > > >>>>> > > > >> > > > > >> > doubles the tolerance of broker
> failure
> > > > > before
> > > > > > >>>>> data
> > > > > > >>>>> > > > >> > > unavailability from
> > > > > > >>>>> > > > >> > > > > >> 1
> > > > > > >>>>> > > > >> > > > > >> > to 2. This is pretty huge gain for
> any
> > > > > company
> > > > > > >>>>> that
> > > > > > >>>>> > uses
> > > > > > >>>>> > > > >> Kafka at
> > > > > > >>>>> > > > >> > > large
> > > > > > >>>>> > > > >> > > > > >> > scale.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. one-broker-per-disk: The
> > > benefit
> > > > > of
> > > > > > >>>>> > > > >> > > one-broker-per-disk is
> > > > > > >>>>> > > > >> > > > > >> that
> > > > > > >>>>> > > > >> > > > > >> > no major code change is needed in
> > Kafka.
> > > > > Among
> > > > > > >>>>> the
> > > > > > >>>>> > > > >> disadvantage of
> > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk summarized in the
> > KIP
> > > > and
> > > > > > >>>>> previous
> > > > > > >>>>> > > > email
> > > > > > >>>>> > > > >> with
> > > > > > >>>>> > > > >> > > Colin,
> > > > > > >>>>> > > > >> > > > > >> > the biggest one is the 15% throughput
> > > loss
> > > > > > >>>>> compared to
> > > > > > >>>>> > > JBOD
> > > > > > >>>>> > > > >> and
> > > > > > >>>>> > > > >> > > less
> > > > > > >>>>> > > > >> > > > > >> > flexibility to balance across disks.
> > > > Further,
> > > > > > it
> > > > > > >>>>> > probably
> > > > > > >>>>> > > > >> requires
> > > > > > >>>>> > > > >> > > > > >> change
> > > > > > >>>>> > > > >> > > > > >> > to internal deployment tools at
> various
> > > > > > >>>>> companies to
> > > > > > >>>>> > deal
> > > > > > >>>>> > > > >> with
> > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk setup.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-0: This is the setup
> > that
> > > > > used
> > > > > > at
> > > > > > >>>>> > > > Microsoft.
> > > > > > >>>>> > > > >> The
> > > > > > >>>>> > > > >> > > > > >> problem is
> > > > > > >>>>> > > > >> > > > > >> > that a broker becomes unavailable if
> > any
> > > > disk
> > > > > > >>>>> fail.
> > > > > > >>>>> > > Suppose
> > > > > > >>>>> > > > >> > > > > >> > replication-factor=2 and there are 10
> > > disks
> > > > > per
> > > > > > >>>>> > machine.
> > > > > > >>>>> > > > >> Then the
> > > > > > >>>>> > > > >> > > > > >> > probability of of any message becomes
> > > > > > >>>>> unavailable due
> > > > > > >>>>> > to
> > > > > > >>>>> > > > disk
> > > > > > >>>>> > > > >> > > failure
> > > > > > >>>>> > > > >> > > > > >> with
> > > > > > >>>>> > > > >> > > > > >> > RAID-0 is 100X higher than that with
> > > JBOD.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. one-broker-per-few-disks:
> > > > > > >>>>> > > > one-broker-per-few-disk
> > > > > > >>>>> > > > >> is
> > > > > > >>>>> > > > >> > > > > >> somewhere
> > > > > > >>>>> > > > >> > > > > >> > between one-broker-per-disk and
> RAID-0.
> > > So
> > > > it
> > > > > > >>>>> carries
> > > > > > >>>>> > an
> > > > > > >>>>> > > > >> averaged
> > > > > > >>>>> > > > >> > > > > >> > disadvantages of these two
> approaches.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > To answer your question regarding, I
> > > think
> > > > it
> > > > > > is
> > > > > > >>>>> > > reasonable
> > > > > > >>>>> > > > >> to
> > > > > > >>>>> > > > >> > > mange
> > > > > > >>>>> > > > >> > > > > >> disk
> > > > > > >>>>> > > > >> > > > > >> > in Kafka. By "managing disks" we mean
> > the
> > > > > > >>>>> management of
> > > > > > >>>>> > > > >> > > assignment of
> > > > > > >>>>> > > > >> > > > > >> > replicas across disks. Here are my
> > > reasons
> > > > in
> > > > > > >>>>> more
> > > > > > >>>>> > > detail:
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > - I don't think this KIP is a big
> step
> > > > > change.
> > > > > > By
> > > > > > >>>>> > > allowing
> > > > > > >>>>> > > > >> user to
> > > > > > >>>>> > > > >> > > > > >> > configure Kafka to run multiple log
> > > > > directories
> > > > > > >>>>> or
> > > > > > >>>>> > disks
> > > > > > >>>>> > > as
> > > > > > >>>>> > > > >> of
> > > > > > >>>>> > > > >> > > now, it
> > > > > > >>>>> > > > >> > > > > >> is
> > > > > > >>>>> > > > >> > > > > >> > implicit that Kafka manages disks. It
> > is
> > > > just
> > > > > > >>>>> not a
> > > > > > >>>>> > > > complete
> > > > > > >>>>> > > > >> > > feature.
> > > > > > >>>>> > > > >> > > > > >> > Microsoft and probably other
> companies
> > > are
> > > > > > using
> > > > > > >>>>> this
> > > > > > >>>>> > > > feature
> > > > > > >>>>> > > > >> > > under the
> > > > > > >>>>> > > > >> > > > > >> > undesirable effect that a broker will
> > > fail
> > > > > any
> > > > > > >>>>> if any
> > > > > > >>>>> > > disk
> > > > > > >>>>> > > > >> fail.
> > > > > > >>>>> > > > >> > > It is
> > > > > > >>>>> > > > >> > > > > >> good
> > > > > > >>>>> > > > >> > > > > >> > to complete this feature.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > - I think it is reasonable to manage
> > disk
> > > > in
> > > > > > >>>>> Kafka. One
> > > > > > >>>>> > > of
> > > > > > >>>>> > > > >> the
> > > > > > >>>>> > > > >> > > most
> > > > > > >>>>> > > > >> > > > > >> > important work that Kafka is doing is
> > to
> > > > > > >>>>> determine the
> > > > > > >>>>> > > > >> replica
> > > > > > >>>>> > > > >> > > > > >> assignment
> > > > > > >>>>> > > > >> > > > > >> > across brokers and make sure enough
> > > copies
> > > > > of a
> > > > > > >>>>> given
> > > > > > >>>>> > > > >> replica is
> > > > > > >>>>> > > > >> > > > > >> available.
> > > > > > >>>>> > > > >> > > > > >> > I would argue that it is not much
> > > different
> > > > > > than
> > > > > > >>>>> > > > determining
> > > > > > >>>>> > > > >> the
> > > > > > >>>>> > > > >> > > replica
> > > > > > >>>>> > > > >> > > > > >> > assignment across disk conceptually.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > - I would agree that this KIP is
> > improve
> > > > > > >>>>> performance of
> > > > > > >>>>> > > > >> Kafka at
> > > > > > >>>>> > > > >> > > the
> > > > > > >>>>> > > > >> > > > > >> cost
> > > > > > >>>>> > > > >> > > > > >> > of more complexity inside Kafka, by
> > > > switching
> > > > > > >>>>> from
> > > > > > >>>>> > > RAID-10
> > > > > > >>>>> > > > to
> > > > > > >>>>> > > > >> > > JBOD. I
> > > > > > >>>>> > > > >> > > > > >> would
> > > > > > >>>>> > > > >> > > > > >> > argue that this is a right direction.
> > If
> > > we
> > > > > can
> > > > > > >>>>> gain
> > > > > > >>>>> > 20%+
> > > > > > >>>>> > > > >> > > performance by
> > > > > > >>>>> > > > >> > > > > >> > managing NIC in Kafka as compared to
> > > > existing
> > > > > > >>>>> approach
> > > > > > >>>>> > > and
> > > > > > >>>>> > > > >> other
> > > > > > >>>>> > > > >> > > > > >> > alternatives, I would say we should
> > just
> > > do
> > > > > it.
> > > > > > >>>>> Such a
> > > > > > >>>>> > > gain
> > > > > > >>>>> > > > >> in
> > > > > > >>>>> > > > >> > > > > >> performance,
> > > > > > >>>>> > > > >> > > > > >> > or equivalently reduction in cost,
> can
> > > save
> > > > > > >>>>> millions of
> > > > > > >>>>> > > > >> dollars
> > > > > > >>>>> > > > >> > > per year
> > > > > > >>>>> > > > >> > > > > >> > for any company running Kafka at
> large
> > > > scale.
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > Thanks,
> > > > > > >>>>> > > > >> > > > > >> > Dong
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> > On Wed, Feb 1, 2017 at 5:41 AM, Eno
> > > > Thereska
> > > > > <
> > > > > > >>>>> > > > >> > > eno.there...@gmail.com>
> > > > > > >>>>> > > > >> > > > > >> wrote:
> > > > > > >>>>> > > > >> > > > > >> >
> > > > > > >>>>> > > > >> > > > > >> >> I'm coming somewhat late to the
> > > > discussion,
> > > > > > >>>>> apologies
> > > > > > >>>>> > > for
> > > > > > >>>>> > > > >> that.
> > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > >>>>> > > > >> > > > > >> >> I'm worried about this proposal.
> It's
> > > > moving
> > > > > > >>>>> Kafka to
> > > > > > >>>>> > a
> > > > > > >>>>> > > > >> world
> > > > > > >>>>> > > > >> > > where it
> > > > > > >>>>> > > > >> > > > > >> >> manages disks. So in a sense, the
> > scope
> > > of
> > > > > the
> > > > > > >>>>> KIP is
> > > > > > >>>>> > > > >> limited,
> > > > > > >>>>> > > > >> > > but the
> > > > > > >>>>> > > > >> > > > > >> >> direction it sets for Kafka is
> quite a
> > > big
> > > > > > step
> > > > > > >>>>> > change.
> > > > > > >>>>> > > > >> > > Fundamentally
> > > > > > >>>>> > > > >> > > > > >> this
> > > > > > >>>>> > > > >> > > > > >> >> is about balancing resources for a
> > Kafka
> > > > > > >>>>> broker. This
> > > > > > >>>>> > > can
> > > > > > >>>>> > > > be
> > > > > > >>>>> > > > >> > > done by a
> > > > > > >>>>> > > > >> > > > > >> >> tool, rather than by changing Kafka.
> > > E.g.,
> > > > > the
> > > > > > >>>>> tool
> > > > > > >>>>> > > would
> > > > > > >>>>> > > > >> take a
> > > > > > >>>>> > > > >> > > bunch
> > > > > > >>>>> > > > >> > > > > >> of
> > > > > > >>>>> > > > >> > > > > >> >> disks together, create a volume over
> > > them
> > > > > and
> > > > > > >>>>> export
> > > > > > >>>>> > > that
> > > > > > >>>>> > > > >> to a
> > > > > > >>>>> > > > >> > > Kafka
> > > > > > >>>>> > > > >> > > > > >> broker
> > > > > > >>>>> > > > >> > > > > >> >> (in addition to setting the memory
> > > limits
> > > > > for
> > > > > > >>>>> that
> > > > > > >>>>> > > broker
> > > > > > >>>>> > > > or
> > > > > > >>>>> > > > >> > > limiting
> > > > > > >>>>> > > > >> > > > > >> other
> > > > > > >>>>> > > > >> > > > > >> >> resources). A different bunch of
> disks
> > > can
> > > > > > then
> > > > > > >>>>> make
> > > > > > >>>>> > up
> > > > > > >>>>> > > a
> > > > > > >>>>> > > > >> second
> > > > > > >>>>> > > > >> > > > > >> volume,
> > > > > > >>>>> > > > >> > > > > >> >> and be used by another Kafka broker.
> > > This
> > > > is
> > > > > > >>>>> aligned
> > > > > > >>>>> > > with
> > > > > > >>>>> > > > >> what
> > > > > > >>>>> > > > >> > > Colin is
> > > > > > >>>>> > > > >> > > > > >> >> saying (as I understand it).
> > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > >>>>> > > > >> > > > > >> >> Disks are not the only resource on a
> > > > > machine,
> > > > > > >>>>> there
> > > > > > >>>>> > are
> > > > > > >>>>> > > > >> several
> > > > > > >>>>> > > > >> > > > > >> instances
> > > > > > >>>>> > > > >> > > > > >> >> where multiple NICs are used for
> > > example.
> > > > Do
> > > > > > we
> > > > > > >>>>> want
> > > > > > >>>>> > > fine
> > > > > > >>>>> > > > >> grained
> > > > > > >>>>> > > > >> > > > > >> >> management of all these resources?
> I'd
> > > > argue
> > > > > > >>>>> that
> > > > > > >>>>> > opens
> > > > > > >>>>> > > us
> > > > > > >>>>> > > > >> the
> > > > > > >>>>> > > > >> > > system
> > > > > > >>>>> > > > >> > > > > >> to a
> > > > > > >>>>> > > > >> > > > > >> >> lot of complexity.
> > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > >>>>> > > > >> > > > > >> >> Thanks
> > > > > > >>>>> > > > >> > > > > >> >> Eno
> > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > >>>>> > > > >> > > > > >> >>> On 1 Feb 2017, at 01:53, Dong Lin <
> > > > > > >>>>> > lindon...@gmail.com
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > >> wrote:
> > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > >>>>> > > > >> > > > > >> >>> Hi all,
> > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > >>>>> > > > >> > > > > >> >>> I am going to initiate the vote If
> > > there
> > > > is
> > > > > > no
> > > > > > >>>>> > further
> > > > > > >>>>> > > > >> concern
> > > > > > >>>>> > > > >> > > with
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> >> KIP.
> > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > >>>>> > > > >> > > > > >> >>> Thanks,
> > > > > > >>>>> > > > >> > > > > >> >>> Dong
> > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > >>>>> > > > >> > > > > >> >>> On Fri, Jan 27, 2017 at 8:08 PM,
> > radai
> > > <
> > > > > > >>>>> > > > >> > > radai.rosenbl...@gmail.com>
> > > > > > >>>>> > > > >> > > > > >> >> wrote:
> > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > >>>>> > > > >> > > > > >> >>>> a few extra points:
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>>> 1. broker per disk might also
> incur
> > > more
> > > > > > >>>>> client <-->
> > > > > > >>>>> > > > >> broker
> > > > > > >>>>> > > > >> > > sockets:
> > > > > > >>>>> > > > >> > > > > >> >>>> suppose every producer / consumer
> > > > "talks"
> > > > > to
> > > > > > >>>>> >1
> > > > > > >>>>> > > > partition,
> > > > > > >>>>> > > > >> > > there's a
> > > > > > >>>>> > > > >> > > > > >> >> very
> > > > > > >>>>> > > > >> > > > > >> >>>> good chance that partitions that
> > were
> > > > > > >>>>> co-located on
> > > > > > >>>>> > a
> > > > > > >>>>> > > > >> single
> > > > > > >>>>> > > > >> > > 10-disk
> > > > > > >>>>> > > > >> > > > > >> >> broker
> > > > > > >>>>> > > > >> > > > > >> >>>> would now be split between several
> > > > > > single-disk
> > > > > > >>>>> > broker
> > > > > > >>>>> > > > >> > > processes on
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> >> same
> > > > > > >>>>> > > > >> > > > > >> >>>> machine. hard to put a multiplier
> on
> > > > this,
> > > > > > but
> > > > > > >>>>> > likely
> > > > > > >>>>> > > > >x1.
> > > > > > >>>>> > > > >> > > sockets
> > > > > > >>>>> > > > >> > > > > >> are a
> > > > > > >>>>> > > > >> > > > > >> >>>> limited resource at the OS level
> and
> > > > incur
> > > > > > >>>>> some
> > > > > > >>>>> > memory
> > > > > > >>>>> > > > >> cost
> > > > > > >>>>> > > > >> > > (kernel
> > > > > > >>>>> > > > >> > > > > >> >>>> buffers)
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>>> 2. there's a memory overhead to
> > > spinning
> > > > > up
> > > > > > a
> > > > > > >>>>> JVM
> > > > > > >>>>> > > > >> (compiled
> > > > > > >>>>> > > > >> > > code and
> > > > > > >>>>> > > > >> > > > > >> >> byte
> > > > > > >>>>> > > > >> > > > > >> >>>> code objects etc). if we assume
> this
> > > > > > overhead
> > > > > > >>>>> is
> > > > > > >>>>> > ~300
> > > > > > >>>>> > > MB
> > > > > > >>>>> > > > >> > > (order of
> > > > > > >>>>> > > > >> > > > > >> >>>> magnitude, specifics vary) than
> > > spinning
> > > > > up
> > > > > > >>>>> 10 JVMs
> > > > > > >>>>> > > > would
> > > > > > >>>>> > > > >> lose
> > > > > > >>>>> > > > >> > > you 3
> > > > > > >>>>> > > > >> > > > > >> GB
> > > > > > >>>>> > > > >> > > > > >> >> of
> > > > > > >>>>> > > > >> > > > > >> >>>> RAM. not a ton, but non
> negligible.
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>>> 3. there would also be some
> overhead
> > > > > > >>>>> downstream of
> > > > > > >>>>> > > kafka
> > > > > > >>>>> > > > >> in any
> > > > > > >>>>> > > > >> > > > > >> >> management
> > > > > > >>>>> > > > >> > > > > >> >>>> / monitoring / log aggregation
> > system.
> > > > > > likely
> > > > > > >>>>> less
> > > > > > >>>>> > > than
> > > > > > >>>>> > > > >> x10
> > > > > > >>>>> > > > >> > > though.
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>>> 4. (related to above) - added
> > > complexity
> > > > > of
> > > > > > >>>>> > > > administration
> > > > > > >>>>> > > > >> > > with more
> > > > > > >>>>> > > > >> > > > > >> >>>> running instances.
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>>> is anyone running kafka with
> > anywhere
> > > > near
> > > > > > >>>>> 100GB
> > > > > > >>>>> > > heaps?
> > > > > > >>>>> > > > i
> > > > > > >>>>> > > > >> > > thought the
> > > > > > >>>>> > > > >> > > > > >> >> point
> > > > > > >>>>> > > > >> > > > > >> >>>> was to rely on kernel page cache
> to
> > do
> > > > the
> > > > > > >>>>> disk
> > > > > > >>>>> > > > buffering
> > > > > > >>>>> > > > >> ....
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>>> On Thu, Jan 26, 2017 at 11:00 AM,
> > Dong
> > > > > Lin <
> > > > > > >>>>> > > > >> > > lindon...@gmail.com>
> > > > > > >>>>> > > > >> > > > > >> wrote:
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> Hey Colin,
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> Thanks much for the comment.
> Please
> > > see
> > > > > me
> > > > > > >>>>> comment
> > > > > > >>>>> > > > >> inline.
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> On Thu, Jan 26, 2017 at 10:15 AM,
> > > Colin
> > > > > > >>>>> McCabe <
> > > > > > >>>>> > > > >> > > cmcc...@apache.org>
> > > > > > >>>>> > > > >> > > > > >> >>>> wrote:
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> On Wed, Jan 25, 2017, at 13:50,
> > Dong
> > > > Lin
> > > > > > >>>>> wrote:
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> Hey Colin,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> Good point! Yeah we have
> actually
> > > > > > >>>>> considered and
> > > > > > >>>>> > > > >> tested this
> > > > > > >>>>> > > > >> > > > > >> >>>> solution,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> which we call
> > one-broker-per-disk.
> > > It
> > > > > > >>>>> would work
> > > > > > >>>>> > > and
> > > > > > >>>>> > > > >> should
> > > > > > >>>>> > > > >> > > > > >> require
> > > > > > >>>>> > > > >> > > > > >> >>>> no
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> major change in Kafka as
> compared
> > > to
> > > > > this
> > > > > > >>>>> JBOD
> > > > > > >>>>> > KIP.
> > > > > > >>>>> > > > So
> > > > > > >>>>> > > > >> it
> > > > > > >>>>> > > > >> > > would
> > > > > > >>>>> > > > >> > > > > >> be a
> > > > > > >>>>> > > > >> > > > > >> >>>>> good
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> short term solution.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> But it has a few drawbacks
> which
> > > > makes
> > > > > it
> > > > > > >>>>> less
> > > > > > >>>>> > > > >> desirable in
> > > > > > >>>>> > > > >> > > the
> > > > > > >>>>> > > > >> > > > > >> long
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> term.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> Assume we have 10 disks on a
> > > machine.
> > > > > > Here
> > > > > > >>>>> are
> > > > > > >>>>> > the
> > > > > > >>>>> > > > >> problems:
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Hi Dong,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Thanks for the thoughtful reply.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> 1) Our stress test result shows
> > > that
> > > > > > >>>>> > > > >> one-broker-per-disk
> > > > > > >>>>> > > > >> > > has 15%
> > > > > > >>>>> > > > >> > > > > >> >>>> lower
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> throughput
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> 2) Controller would need to
> send
> > > 10X
> > > > as
> > > > > > >>>>> many
> > > > > > >>>>> > > > >> > > LeaderAndIsrRequest,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> MetadataUpdateRequest and
> > > > > > >>>>> StopReplicaRequest.
> > > > > > >>>>> > This
> > > > > > >>>>> > > > >> > > increases the
> > > > > > >>>>> > > > >> > > > > >> >>>> burden
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> on
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> controller which can be the
> > > > performance
> > > > > > >>>>> > bottleneck.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Maybe I'm misunderstanding
> > > something,
> > > > > but
> > > > > > >>>>> there
> > > > > > >>>>> > > would
> > > > > > >>>>> > > > >> not be
> > > > > > >>>>> > > > >> > > 10x as
> > > > > > >>>>> > > > >> > > > > >> >>>> many
> > > > > > >>>>> > > > >> > > > > >> >>>>>> StopReplicaRequest RPCs, would
> > > there?
> > > > > The
> > > > > > >>>>> other
> > > > > > >>>>> > > > >> requests
> > > > > > >>>>> > > > >> > > would
> > > > > > >>>>> > > > >> > > > > >> >>>> increase
> > > > > > >>>>> > > > >> > > > > >> >>>>>> 10x, but from a pretty low base,
> > > > right?
> > > > > > We
> > > > > > >>>>> are
> > > > > > >>>>> > not
> > > > > > >>>>> > > > >> > > reassigning
> > > > > > >>>>> > > > >> > > > > >> >>>>>> partitions all the time, I hope
> > (or
> > > > else
> > > > > > we
> > > > > > >>>>> have
> > > > > > >>>>> > > > bigger
> > > > > > >>>>> > > > >> > > > > >> problems...)
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> I think the controller will group
> > > > > > >>>>> > StopReplicaRequest
> > > > > > >>>>> > > > per
> > > > > > >>>>> > > > >> > > broker and
> > > > > > >>>>> > > > >> > > > > >> >> send
> > > > > > >>>>> > > > >> > > > > >> >>>>> only one StopReplicaRequest to a
> > > broker
> > > > > > >>>>> during
> > > > > > >>>>> > > > controlled
> > > > > > >>>>> > > > >> > > shutdown.
> > > > > > >>>>> > > > >> > > > > >> >>>> Anyway,
> > > > > > >>>>> > > > >> > > > > >> >>>>> we don't have to worry about this
> > if
> > > we
> > > > > > >>>>> agree that
> > > > > > >>>>> > > > other
> > > > > > >>>>> > > > >> > > requests
> > > > > > >>>>> > > > >> > > > > >> will
> > > > > > >>>>> > > > >> > > > > >> >>>>> increase by 10X. One
> > MetadataRequest
> > > to
> > > > > > send
> > > > > > >>>>> to
> > > > > > >>>>> > each
> > > > > > >>>>> > > > >> broker
> > > > > > >>>>> > > > >> > > in the
> > > > > > >>>>> > > > >> > > > > >> >>>> cluster
> > > > > > >>>>> > > > >> > > > > >> >>>>> every time there is leadership
> > > change.
> > > > I
> > > > > am
> > > > > > >>>>> not
> > > > > > >>>>> > sure
> > > > > > >>>>> > > > >> this is
> > > > > > >>>>> > > > >> > > a real
> > > > > > >>>>> > > > >> > > > > >> >>>>> problem. But in theory this makes
> > the
> > > > > > >>>>> overhead
> > > > > > >>>>> > > > complexity
> > > > > > >>>>> > > > >> > > O(number
> > > > > > >>>>> > > > >> > > > > >> of
> > > > > > >>>>> > > > >> > > > > >> >>>>> broker) and may be a concern in
> the
> > > > > future.
> > > > > > >>>>> Ideally
> > > > > > >>>>> > > we
> > > > > > >>>>> > > > >> should
> > > > > > >>>>> > > > >> > > avoid
> > > > > > >>>>> > > > >> > > > > >> it.
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> 3) Less efficient use of
> physical
> > > > > > resource
> > > > > > >>>>> on the
> > > > > > >>>>> > > > >> machine.
> > > > > > >>>>> > > > >> > > The
> > > > > > >>>>> > > > >> > > > > >> number
> > > > > > >>>>> > > > >> > > > > >> >>>>> of
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> socket on each machine will
> > > increase
> > > > by
> > > > > > >>>>> 10X. The
> > > > > > >>>>> > > > >> number of
> > > > > > >>>>> > > > >> > > > > >> connection
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> between any two machine will
> > > increase
> > > > > by
> > > > > > >>>>> 100X.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> 4) Less efficient way to
> > management
> > > > > > memory
> > > > > > >>>>> and
> > > > > > >>>>> > > quota.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> 5) Rebalance between
> > disks/brokers
> > > on
> > > > > the
> > > > > > >>>>> same
> > > > > > >>>>> > > > machine
> > > > > > >>>>> > > > >> will
> > > > > > >>>>> > > > >> > > less
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> efficient
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> and less flexible. Broker has
> to
> > > read
> > > > > > data
> > > > > > >>>>> from
> > > > > > >>>>> > > > another
> > > > > > >>>>> > > > >> > > broker on
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> same
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> machine via socket. It is also
> > > harder
> > > > > to
> > > > > > do
> > > > > > >>>>> > > automatic
> > > > > > >>>>> > > > >> load
> > > > > > >>>>> > > > >> > > balance
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> between
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> disks on the same machine in
> the
> > > > > future.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> I will put this and the
> > explanation
> > > > in
> > > > > > the
> > > > > > >>>>> > rejected
> > > > > > >>>>> > > > >> > > alternative
> > > > > > >>>>> > > > >> > > > > >> >>>>> section.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> I
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> have a few questions:
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Can you explain why this
> > solution
> > > > can
> > > > > > >>>>> help
> > > > > > >>>>> > avoid
> > > > > > >>>>> > > > >> > > scalability
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> bottleneck?
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> I actually think it will
> > exacerbate
> > > > the
> > > > > > >>>>> > scalability
> > > > > > >>>>> > > > >> problem
> > > > > > >>>>> > > > >> > > due
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> >>>> 2)
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> above.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Why can we push more RPC with
> > > this
> > > > > > >>>>> solution?
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> To really answer this question
> > we'd
> > > > have
> > > > > > to
> > > > > > >>>>> take a
> > > > > > >>>>> > > > deep
> > > > > > >>>>> > > > >> dive
> > > > > > >>>>> > > > >> > > into
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> >>>>>> locking of the broker and figure
> > out
> > > > how
> > > > > > >>>>> > effectively
> > > > > > >>>>> > > > it
> > > > > > >>>>> > > > >> can
> > > > > > >>>>> > > > >> > > > > >> >> parallelize
> > > > > > >>>>> > > > >> > > > > >> >>>>>> truly independent requests.
> > Almost
> > > > > every
> > > > > > >>>>> > > > multithreaded
> > > > > > >>>>> > > > >> > > process is
> > > > > > >>>>> > > > >> > > > > >> >>>> going
> > > > > > >>>>> > > > >> > > > > >> >>>>>> to have shared state, like
> shared
> > > > queues
> > > > > > or
> > > > > > >>>>> shared
> > > > > > >>>>> > > > >> sockets,
> > > > > > >>>>> > > > >> > > that is
> > > > > > >>>>> > > > >> > > > > >> >>>>>> going to make scaling less than
> > > linear
> > > > > > when
> > > > > > >>>>> you
> > > > > > >>>>> > add
> > > > > > >>>>> > > > >> disks or
> > > > > > >>>>> > > > >> > > > > >> >>>> processors.
> > > > > > >>>>> > > > >> > > > > >> >>>>>> (And clearly, another option is
> to
> > > > > improve
> > > > > > >>>>> that
> > > > > > >>>>> > > > >> scalability,
> > > > > > >>>>> > > > >> > > rather
> > > > > > >>>>> > > > >> > > > > >> >>>>>> than going multi-process!)
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> Yeah I also think it is better to
> > > > improve
> > > > > > >>>>> > scalability
> > > > > > >>>>> > > > >> inside
> > > > > > >>>>> > > > >> > > kafka
> > > > > > >>>>> > > > >> > > > > >> code
> > > > > > >>>>> > > > >> > > > > >> >>>> if
> > > > > > >>>>> > > > >> > > > > >> >>>>> possible. I am not sure we
> > currently
> > > > have
> > > > > > any
> > > > > > >>>>> > > > scalability
> > > > > > >>>>> > > > >> > > issue
> > > > > > >>>>> > > > >> > > > > >> inside
> > > > > > >>>>> > > > >> > > > > >> >>>>> Kafka that can not be removed
> > without
> > > > > using
> > > > > > >>>>> > > > >> multi-process.
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> - It is true that a garbage
> > > > collection
> > > > > in
> > > > > > >>>>> one
> > > > > > >>>>> > > broker
> > > > > > >>>>> > > > >> would
> > > > > > >>>>> > > > >> > > not
> > > > > > >>>>> > > > >> > > > > >> affect
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> others. But that is after every
> > > > broker
> > > > > > >>>>> only uses
> > > > > > >>>>> > > 1/10
> > > > > > >>>>> > > > >> of the
> > > > > > >>>>> > > > >> > > > > >> memory.
> > > > > > >>>>> > > > >> > > > > >> >>>>> Can
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> we be sure that this will
> > actually
> > > > help
> > > > > > >>>>> > > performance?
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> The big question is, how much
> > memory
> > > > do
> > > > > > >>>>> Kafka
> > > > > > >>>>> > > brokers
> > > > > > >>>>> > > > >> use
> > > > > > >>>>> > > > >> > > now, and
> > > > > > >>>>> > > > >> > > > > >> how
> > > > > > >>>>> > > > >> > > > > >> >>>>>> much will they use in the
> future?
> > > Our
> > > > > > >>>>> experience
> > > > > > >>>>> > in
> > > > > > >>>>> > > > >> HDFS
> > > > > > >>>>> > > > >> > > was that
> > > > > > >>>>> > > > >> > > > > >> >> once
> > > > > > >>>>> > > > >> > > > > >> >>>>>> you start getting more than
> > > 100-200GB
> > > > > Java
> > > > > > >>>>> heap
> > > > > > >>>>> > > sizes,
> > > > > > >>>>> > > > >> full
> > > > > > >>>>> > > > >> > > GCs
> > > > > > >>>>> > > > >> > > > > >> start
> > > > > > >>>>> > > > >> > > > > >> >>>>>> taking minutes to finish when
> > using
> > > > the
> > > > > > >>>>> standard
> > > > > > >>>>> > > JVMs.
> > > > > > >>>>> > > > >> That
> > > > > > >>>>> > > > >> > > alone
> > > > > > >>>>> > > > >> > > > > >> is
> > > > > > >>>>> > > > >> > > > > >> >> a
> > > > > > >>>>> > > > >> > > > > >> >>>>>> good reason to go multi-process
> or
> > > > > > consider
> > > > > > >>>>> > storing
> > > > > > >>>>> > > > more
> > > > > > >>>>> > > > >> > > things off
> > > > > > >>>>> > > > >> > > > > >> >> the
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Java heap.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> I see. Now I agree
> > > one-broker-per-disk
> > > > > > >>>>> should be
> > > > > > >>>>> > more
> > > > > > >>>>> > > > >> > > efficient in
> > > > > > >>>>> > > > >> > > > > >> >> terms
> > > > > > >>>>> > > > >> > > > > >> >>>> of
> > > > > > >>>>> > > > >> > > > > >> >>>>> GC since each broker probably
> needs
> > > > less
> > > > > > >>>>> than 1/10
> > > > > > >>>>> > of
> > > > > > >>>>> > > > the
> > > > > > >>>>> > > > >> > > memory
> > > > > > >>>>> > > > >> > > > > >> >>>> available
> > > > > > >>>>> > > > >> > > > > >> >>>>> on a typical machine nowadays. I
> > will
> > > > > > remove
> > > > > > >>>>> this
> > > > > > >>>>> > > from
> > > > > > >>>>> > > > >> the
> > > > > > >>>>> > > > >> > > reason of
> > > > > > >>>>> > > > >> > > > > >> >>>>> rejection.
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Disk failure is the "easy" case.
> > > The
> > > > > > >>>>> "hard" case,
> > > > > > >>>>> > > > >> which is
> > > > > > >>>>> > > > >> > > > > >> >>>>>> unfortunately also the much more
> > > > common
> > > > > > >>>>> case, is
> > > > > > >>>>> > > disk
> > > > > > >>>>> > > > >> > > misbehavior.
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Towards the end of their lives,
> > > disks
> > > > > tend
> > > > > > >>>>> to
> > > > > > >>>>> > start
> > > > > > >>>>> > > > >> slowing
> > > > > > >>>>> > > > >> > > down
> > > > > > >>>>> > > > >> > > > > >> >>>>>> unpredictably.  Requests that
> > would
> > > > have
> > > > > > >>>>> completed
> > > > > > >>>>> > > > >> > > immediately
> > > > > > >>>>> > > > >> > > > > >> before
> > > > > > >>>>> > > > >> > > > > >> >>>>>> start taking 20, 100 500
> > > milliseconds.
> > > > > > >>>>> Some files
> > > > > > >>>>> > > may
> > > > > > >>>>> > > > >> be
> > > > > > >>>>> > > > >> > > readable
> > > > > > >>>>> > > > >> > > > > >> and
> > > > > > >>>>> > > > >> > > > > >> >>>>>> other files may not be.  System
> > > calls
> > > > > > hang,
> > > > > > >>>>> > > sometimes
> > > > > > >>>>> > > > >> > > forever, and
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Java process can't abort them,
> > > because
> > > > > the
> > > > > > >>>>> hang is
> > > > > > >>>>> > > in
> > > > > > >>>>> > > > >> the
> > > > > > >>>>> > > > >> > > kernel.
> > > > > > >>>>> > > > >> > > > > >> It
> > > > > > >>>>> > > > >> > > > > >> >>>> is
> > > > > > >>>>> > > > >> > > > > >> >>>>>> not fun when threads are stuck
> in
> > "D
> > > > > > state"
> > > > > > >>>>> > > > >> > > > > >> >>>>>> http://stackoverflow.com/quest
> > > > > > >>>>> > > > >> ions/20423521/process-perminan
> > > > > > >>>>> > > > >> > > > > >> >>>>>> tly-stuck-on-d-state
> > > > > > >>>>> > > > >> > > > > >> >>>>>> .  Even kill -9 cannot abort the
> > > > thread
> > > > > > >>>>> then.
> > > > > > >>>>> > > > >> Fortunately,
> > > > > > >>>>> > > > >> > > this is
> > > > > > >>>>> > > > >> > > > > >> >>>>>> rare.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> I agree it is a harder problem
> and
> > it
> > > > is
> > > > > > >>>>> rare. We
> > > > > > >>>>> > > > >> probably
> > > > > > >>>>> > > > >> > > don't
> > > > > > >>>>> > > > >> > > > > >> have
> > > > > > >>>>> > > > >> > > > > >> >> to
> > > > > > >>>>> > > > >> > > > > >> >>>>> worry about it in this KIP since
> > this
> > > > > issue
> > > > > > >>>>> is
> > > > > > >>>>> > > > >> orthogonal to
> > > > > > >>>>> > > > >> > > > > >> whether or
> > > > > > >>>>> > > > >> > > > > >> >>>> not
> > > > > > >>>>> > > > >> > > > > >> >>>>> we use JBOD.
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Another approach we should
> > consider
> > > is
> > > > > for
> > > > > > >>>>> Kafka
> > > > > > >>>>> > to
> > > > > > >>>>> > > > >> > > implement its
> > > > > > >>>>> > > > >> > > > > >> own
> > > > > > >>>>> > > > >> > > > > >> >>>>>> storage layer that would stripe
> > > across
> > > > > > >>>>> multiple
> > > > > > >>>>> > > disks.
> > > > > > >>>>> > > > >> This
> > > > > > >>>>> > > > >> > > > > >> wouldn't
> > > > > > >>>>> > > > >> > > > > >> >>>>>> have to be done at the block
> > level,
> > > > but
> > > > > > >>>>> could be
> > > > > > >>>>> > > done
> > > > > > >>>>> > > > >> at the
> > > > > > >>>>> > > > >> > > file
> > > > > > >>>>> > > > >> > > > > >> >>>> level.
> > > > > > >>>>> > > > >> > > > > >> >>>>>> We could use consistent hashing
> to
> > > > > > >>>>> determine which
> > > > > > >>>>> > > > >> disks a
> > > > > > >>>>> > > > >> > > file
> > > > > > >>>>> > > > >> > > > > >> should
> > > > > > >>>>> > > > >> > > > > >> >>>>>> end up on, for example.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>> Are you suggesting that we should
> > > > > > distribute
> > > > > > >>>>> log,
> > > > > > >>>>> > or
> > > > > > >>>>> > > > log
> > > > > > >>>>> > > > >> > > segment,
> > > > > > >>>>> > > > >> > > > > >> >> across
> > > > > > >>>>> > > > >> > > > > >> >>>>> disks of brokers? I am not sure
> if
> > I
> > > > > fully
> > > > > > >>>>> > understand
> > > > > > >>>>> > > > >> this
> > > > > > >>>>> > > > >> > > > > >> approach. My
> > > > > > >>>>> > > > >> > > > > >> >>>> gut
> > > > > > >>>>> > > > >> > > > > >> >>>>> feel is that this would be a
> > drastic
> > > > > > >>>>> solution that
> > > > > > >>>>> > > > would
> > > > > > >>>>> > > > >> > > require
> > > > > > >>>>> > > > >> > > > > >> >>>>> non-trivial design. While this
> may
> > be
> > > > > > useful
> > > > > > >>>>> to
> > > > > > >>>>> > > Kafka,
> > > > > > >>>>> > > > I
> > > > > > >>>>> > > > >> would
> > > > > > >>>>> > > > >> > > > > >> prefer
> > > > > > >>>>> > > > >> > > > > >> >> not
> > > > > > >>>>> > > > >> > > > > >> >>>>> to discuss this in detail in this
> > > > thread
> > > > > > >>>>> unless you
> > > > > > >>>>> > > > >> believe
> > > > > > >>>>> > > > >> > > it is
> > > > > > >>>>> > > > >> > > > > >> >>>> strictly
> > > > > > >>>>> > > > >> > > > > >> >>>>> superior to the design in this
> KIP
> > in
> > > > > terms
> > > > > > >>>>> of
> > > > > > >>>>> > > solving
> > > > > > >>>>> > > > >> our
> > > > > > >>>>> > > > >> > > use-case.
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>> best,
> > > > > > >>>>> > > > >> > > > > >> >>>>>> Colin
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> Thanks,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> Dong
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> On Wed, Jan 25, 2017 at 11:34
> AM,
> > > > Colin
> > > > > > >>>>> McCabe <
> > > > > > >>>>> > > > >> > > > > >> cmcc...@apache.org>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>> wrote:
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Hi Dong,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Thanks for the writeup!  It's
> > very
> > > > > > >>>>> interesting.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> I apologize in advance if this
> > has
> > > > > been
> > > > > > >>>>> > discussed
> > > > > > >>>>> > > > >> > > somewhere else.
> > > > > > >>>>> > > > >> > > > > >> >>>>> But
> > > > > > >>>>> > > > >> > > > > >> >>>>>> I
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> am curious if you have
> > considered
> > > > the
> > > > > > >>>>> solution
> > > > > > >>>>> > of
> > > > > > >>>>> > > > >> running
> > > > > > >>>>> > > > >> > > > > >> multiple
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> brokers per node.  Clearly
> there
> > > is
> > > > a
> > > > > > >>>>> memory
> > > > > > >>>>> > > > overhead
> > > > > > >>>>> > > > >> with
> > > > > > >>>>> > > > >> > > this
> > > > > > >>>>> > > > >> > > > > >> >>>>>> solution
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> because of the fixed cost of
> > > > starting
> > > > > > >>>>> multiple
> > > > > > >>>>> > > JVMs.
> > > > > > >>>>> > > > >> > > However,
> > > > > > >>>>> > > > >> > > > > >> >>>>> running
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> multiple JVMs would help avoid
> > > > > > scalability
> > > > > > >>>>> > > > >> bottlenecks.
> > > > > > >>>>> > > > >> > > You
> > > > > > >>>>> > > > >> > > > > >> could
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> probably push more RPCs per
> > > second,
> > > > > for
> > > > > > >>>>> example.
> > > > > > >>>>> > > A
> > > > > > >>>>> > > > >> garbage
> > > > > > >>>>> > > > >> > > > > >> >>>>> collection
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> in one broker would not affect
> > the
> > > > > > >>>>> others.  It
> > > > > > >>>>> > > would
> > > > > > >>>>> > > > >> be
> > > > > > >>>>> > > > >> > > > > >> interesting
> > > > > > >>>>> > > > >> > > > > >> >>>>> to
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> see this considered in the
> > > > "alternate
> > > > > > >>>>> designs"
> > > > > > >>>>> > > > design,
> > > > > > >>>>> > > > >> > > even if
> > > > > > >>>>> > > > >> > > > > >> you
> > > > > > >>>>> > > > >> > > > > >> >>>>> end
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> up deciding it's not the way
> to
> > > go.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> best,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Colin
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> On Thu, Jan 12, 2017, at
> 10:46,
> > > Dong
> > > > > Lin
> > > > > > >>>>> wrote:
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Hi all,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> We created KIP-112: Handle
> disk
> > > > > failure
> > > > > > >>>>> for
> > > > > > >>>>> > JBOD.
> > > > > > >>>>> > > > >> Please
> > > > > > >>>>> > > > >> > > find
> > > > > > >>>>> > > > >> > > > > >> the
> > > > > > >>>>> > > > >> > > > > >> >>>>> KIP
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> wiki
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> in the link
> > > > > > >>>>> https://cwiki.apache.org/confl
> > > > > > >>>>> > > > >> > > > > >> >>>> uence/display/KAFKA/KIP-
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> 112%3A+Handle+disk+failure+
> > > > for+JBOD.
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> This KIP is related to
> KIP-113
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> <
> https://cwiki.apache.org/conf
> > > > > > >>>>> > > > >> luence/display/KAFKA/KIP-
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>> 113%3A+Support+replicas+moveme
> > > > > > >>>>> > > > >> nt+between+log+directories>:
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Support replicas movement
> > between
> > > > log
> > > > > > >>>>> > > directories.
> > > > > > >>>>> > > > >> They
> > > > > > >>>>> > > > >> > > are
> > > > > > >>>>> > > > >> > > > > >> >>>> needed
> > > > > > >>>>> > > > >> > > > > >> >>>>> in
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> order
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> to support JBOD in Kafka.
> > Please
> > > > help
> > > > > > >>>>> review
> > > > > > >>>>> > the
> > > > > > >>>>> > > > >> KIP. You
> > > > > > >>>>> > > > >> > > > > >> >>>> feedback
> > > > > > >>>>> > > > >> > > > > >> >>>>> is
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> appreciated!
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Thanks,
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Dong
> > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >>
> > > > > > >>>>> > > > >> > > > > >
> > > > > > >>>>> > > > >> > >
> > > > > > >>>>> > > > >>
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to