[
https://issues.apache.org/jira/browse/NIFI-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17615205#comment-17615205
]
ASF subversion and git services commented on NIFI-9878:
-------------------------------------------------------
Commit 9a4ce2607dfbf6e9a9731a19536aa7a4f5552ffd in nifi's branch
refs/heads/main from Jon Shoemaker
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=9a4ce2607d ]
NIFI-9878 Added timeout handling for Cache Client handshaking
This closes #6414
Co-authored-by: Nissim Shiman <[email protected]>
Co-authored-by: Jon Shoemaker <[email protected]>
Signed-off-by: David Handermann <[email protected]>
> DistributedCacheMap Handshake failure, processor hang indefinitely.
> -------------------------------------------------------------------
>
> Key: NIFI-9878
> URL: https://issues.apache.org/jira/browse/NIFI-9878
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.15.3, 1.17.0, 1.16.3
> Reporter: Aaron Rich
> Assignee: Jon Shoemaker
> Priority: Major
> Labels: Handshake, distributed_cache
> Attachments:
> 0001-NIFI-9878-fix-for-hanging-client-thread-with-handsha.patch,
> image-2022-04-05-21-54-31-002.png, image-2022-04-05-21-55-16-221.png
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> When a DistributedCacheMapClient attempts to connect to a
> DistributedCacheMapServer, but the handshake response is never received by
> the client, the PutDistributedCacheMap processor with hang indefinitely. The
> handshake never times out.
> A situation like this can be caused if a proxy allows for the TCP connection
> to be established between client and server but fails to deliver handshake
> data to/from DistributedCacheMapServer (for example an unstable Istio service
> mesh between the two). Could also happen if a client was accidentally
> misconfigured to point to wrong TCP server point (one that wasn't hosting a
> DistributedCacheMapServer.
> Steps to recreate:
> 1) Set up a PutDistributedCacheMap processor with a
> DistributedMapCacheClientService
> 2) Configure DistributedMapCacheClientService to point to a non
> DistributedCacheMapServer tcp server (nc -lk 127.0.0.1 4457). This simulates
> a situation where the socket connection can be made but there is no handshake
> response from the server (for example, server is in bad state and unable to
> respond, a proxy is misbehaving, etc).
> 3) use generateFlowFile to trigger PutDistributedCacheMap processor.
> 4) processor will hang with no failure or success. Processor will have to be
> force terminated.
> !image-2022-04-05-21-54-31-002.png!
> !image-2022-04-05-21-55-16-221.png!
> Hang occurs at :
> CacheClientRequestHandler.java:92: handshakeHandler.waitHandshakeComplete();
>
> Currently, the "connection timeout" parameter is only used to timeout the
> establishment of the TCP socket connection, not the full application layer
> connection.
> Suggestion:
> Handshake should have a timeout too to be robust to handle a network outage
> where the TCP connection is able to be created, but the handshake data can't
> be exchanged. The processor hanging prevents any way to handle this error in
> a dataflow.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)