[
https://issues.apache.org/jira/browse/KAFKA-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635283#comment-14635283
]
Flavio Junqueira commented on KAFKA-2188:
-----------------------------------------
hey tim, I had a look at the proposal, and I have some feedback, mostly
questions at this point. I like this improvement, and in general, I've found
that we can improve quite a bit exception handling in Kafka. This is clearly
one such great effort. Specifically, here are more concrete points:
# In the exception handler section, I'd say that the best approach is to be
conservative and remove the drive in the case of an error. Let's not optimize
too much trying to get the exact partitions that are affected by an error and
such. If there is an error, then let an operator check it out and reinsert the
drive when fixed. As part of this comment, I'd say that it'd be a good feature
to allow drives to be inserted (manually).
# In the notifying controller discussion, could you be more specific about the
race you're concerned about? I can tell that you're pointing out to a potential
race, but I'm not sure what it is.
# Open question 1: disk availability. It's kind of hard to detect exactly what
happened with a faulty disk. It could be disk full, drive is bad, or even just
some annoying data corruption. I don't think it is worth spending tons of time
and effort trying to make a great check. If we spot an error, then remove the
drive and log it. I don't know if there is any typical mechanism to notify
operators with Kafka.
# Open question 2: log read. I think I know the problem you're referring to,
and I'll have a look to see if I can suggest some decent alternative, but we
might need to make it a bit less efficient to be able to handle IO errors
properly.
# Open question 3: restart partition. This is about the race I asked above.
# Open question 4: operation retries. What would be a situation in which it is
worth retrying?
I was actually wondering if some users would be interested in the case of
leaving a fraction of the drives unused to replace faulty drives over time. The
advantage is to be able to maintain the capacity of a broker despite faulty
drives, but surely you have some unused IO capacity in the broker.
> JBOD Support
> ------------
>
> Key: KAFKA-2188
> URL: https://issues.apache.org/jira/browse/KAFKA-2188
> Project: Kafka
> Issue Type: Bug
> Reporter: Andrii Biletskyi
> Assignee: Andrii Biletskyi
> Attachments: KAFKA-2188.patch, KAFKA-2188.patch, KAFKA-2188.patch
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)