Ewen Cheslack-Postava created KAFKA-1843:
--------------------------------------------

             Summary: Metadata fetch/refresh in new producer should handle all 
node connection states gracefully
                 Key: KAFKA-1843
                 URL: https://issues.apache.org/jira/browse/KAFKA-1843
             Project: Kafka
          Issue Type: Bug
          Components: clients
    Affects Versions: 0.8.2
            Reporter: Ewen Cheslack-Postava


KAFKA-1642 resolved some issues with the handling of broker connection states 
to avoid high CPU usage, but made the minimal fix rather than the ideal one. 
The code for handling the metadata fetch is difficult to get right because it 
has to handle a lot of possible connectivity states and failure modes across 
all the known nodes. It also needs to correctly integrate with the surrounding 
event loop, providing correct poll() timeouts to both avoid busy looping and 
make sure it wakes up and tries new nodes in the face of both connection and 
request failures.

A patch here should address a few issues:
1. Make sure connection timeouts, as implemented in KAFKA-1842, are cleanly 
integrated. This mostly means that when a connecting node is selected to fetch 
metadata from, that the code notices that and sets the next timeout based on 
the connection timeout rather than some other backoff.
2. Rethink the logic and naming of NetworkClient.leastLoadedNode. That method 
actually takes into account a) the current connectivity of each node, b) 
whether the node had a recent connection failure, c) the "load" in terms of in 
flight requests. It also needs to ensure that different clients don't use the 
same ordering across multiple calls (which is already addressed in the current 
code by nodeIndexOffset) and that we always eventually try all nodes in the 
face of connection failures (which isn't currently handled by leastLoadedNode 
and probably cannot be without tracking additional state). This method also has 
to work for new consumer use cases even though it is currently only used by the 
new producer's metadata fetch. Finally it has to properly handle when other 
code calls initiateConnect() since the normal path for sending messages also 
initiates connections.

We can already say that there is an order of preference given a single call (as 
follows), but making this work across multiple calls when some initial choices 
fail to connect or return metadata *and* connection states may be changing is 
much more difficult.
 * Connected, zero in flight requests - the request can be sent immediately
 * Connecting node - it will hopefully be connected very soon and by definition 
has no in flight requests
 * Disconnected - same reasoning as for a connecting node
 * Connected, > 0 in flight requests - we consider any # of in flight requests 
as a big enough backlog to delay the request a lot.
We could use an approach that better accounts for # of in flight requests 
rather than just turning it into a boolean variable, but that probably 
introduces much more complexity than it is worth.

3. The most difficult case to handle so far has been when leastLoadedNode 
returns a disconnected node to maybeUpdateMetadata as its best option. Properly 
handling the two resulting cases (initiateConnect fails immediately vs. taking 
some time to possibly establish the connection) is tricky.
4. Consider optimizing for the failure cases. The most common cases are when 
you already have an active connection and can immediately get the metadata or 
you need to establish a connection, but the connection and metadata 
request/response happen very quickly. These common cases are infrequent enough 
(default every 5 min) that establishing an extra connection isn't a big deal as 
long as it's eventually cleaned up. The edge cases, like network partitions 
where some subset of nodes become unreachable for a long period, are harder to 
reason about but we should be sure we will always be able to gracefully recover 
from them.

KAFKA-1642 enumerated the possible outcomes of a single call to 
maybeUpdateMetadata. A good fix for this would consider all of those outcomes 
for repeated calls to 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to