[ https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maxime Brugidou updated KAFKA-691: ---------------------------------- Attachment: KAFKA-691-v1.patch Here is a first draft (v1) patch. 1. Added the consumer property "producer.metadata.refresh.interval.ms" defaults to 600000 (10min) 2. The metadata is refreshed every 10min (only if a message is sent), and the set of topics to refresh is tracked in the topicMetadataToRefresh Set (cleared after every refresh) - I think the added value of refreshing regardless of partition availability is to detect new partitions 3. The good news is that I didn't touch the Partitioner API, I only changed the code to use available partitions if the key is null (as suggested by Jun), it will also throw a UnknownTopicOrPartitionException("No leader for any partition") if no partition is available at all Let me know what you think about this patch. I ran a producer with that code successfully and tested with a broker down. I now have some concerns about the consumer: the refresh.leader.backoff.ms config could help me (if i increase it to say, 10min) BUT the rebalance fails in any case since there is no leader for some partitions I don't have a good workaround yet for that, any help/suggestion appreciated. > Fault tolerance broken with replication factor 1 > ------------------------------------------------ > > Key: KAFKA-691 > URL: https://issues.apache.org/jira/browse/KAFKA-691 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Jay Kreps > Attachments: KAFKA-691-v1.patch > > > In 0.7 if a partition was down we would just send the message elsewhere. This > meant that the partitioning was really more of a "stickiness" then a hard > guarantee. This made it impossible to depend on it for partitioned, stateful > processing. > In 0.8 when running with replication this should not be a problem generally > as the partitions are now highly available and fail over to other replicas. > However in the case of replication factor = 1 no longer really works for most > cases as now a dead broker will give errors for that broker. > I am not sure of the best fix. Intuitively I think this is something that > should be handled by the Partitioner interface. However currently the > partitioner has no knowledge of which nodes are available. So you could use a > random partitioner, but that would keep going back to the down node. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira