[
https://issues.apache.org/jira/browse/NIFI-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pierre Villard resolved NIFI-16011.
-----------------------------------
Fix Version/s: 2.10.0
Resolution: Fixed
> Repeated system test failures caused by LoadBalanceIT
> -----------------------------------------------------
>
> Key: NIFI-16011
> URL: https://issues.apache.org/jira/browse/NIFI-16011
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 2.10.0
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> We are consistently seeing system test failures. Looking at the logs from
> Github Actions, it appears that LoadBalanceIT is always the first one to
> fail, with the issue then cascading. It seems that the end of the
> LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for
> each of the 100 expected FlowFiles, and this then gets replicated across the
> cluster.
> This, in turn, causes connection pool exhaustion, resulting in
> {code:java}
> IOException: RST_STREAM received {code}
> Which comes back as an HTTP 500 error.
> That test can be tightened up by producing 20 FlowFiles instead of 100. This
> will reduce the number of requests by 5x, giving us much more breathing room.
>
> After digging in, the reduction from 100 FlowFiles to 20 did not provide the
> resilience I was looking for. The issue appears to stem from changes made in
> the latest version of Jetty. It appears that they explicitly and
> intentionally changed how RST_STREAM resets are handled. Reverting the recent
> Jetty version change did confirm that system tests pass. Restoring to the
> latest confirmed failures again. It is important to keep current with Jetty,
> however, and these issues do not appear to affect production instances. They
> affect system tests because system tests constantly restart containers while
> also firing off huge numbers of HTTP requests in very short succession.
> To this end, the approach that I will take is to expose configuring the HTTP
> version to use for intra-cluster communications. We will default to HTTP_2,
> remaining backward compatible. But system tests can make use of HTTP 1.1 in
> order to avoid these failures. This will not be a permanent solution to run
> all system tests using HTTP 1.1, but it is more desirable than the constant
> system failures that we see currently.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)