[ 
https://issues.apache.org/jira/browse/NIFI-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Payne updated NIFI-16011:
------------------------------
    Description: 
We are consistently seeing system test failures. Looking at the logs from 
Github Actions, it appears that LoadBalanceIT is always the first one to fail, 
with the issue then cascading. It seems that the end of the 
LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for 
each of the 100 expected FlowFiles, and this then gets replicated across the 
cluster.

This, in turn, causes connection pool exhaustion, resulting in
{code:java}
IOException: RST_STREAM received {code}
Which comes back as an HTTP 500 error.

That test can be tightened up by producing 20 FlowFiles instead of 100. This 
will reduce the number of requests by 5x, giving us much more breathing room.

 

After digging in, the reduction from 100 FlowFiles to 20 did not provide the 
resilience I was looking for. The issue appears to stem from changes made in 
the latest version of Jetty. It appears that they explicitly and intentionally 
changed how RST_STREAM resets are handled. Reverting the recent Jetty version 
change did confirm that system tests pass. Restoring to the latest confirmed 
failures again. It is important to keep current with Jetty, however, and these 
issues do not appear to affect production instances. They affect system tests 
because system tests constantly restart containers while also firing off huge 
numbers of HTTP requests in very short succession.

To this end, the approach that I will take is to expose configuring the HTTP 
version to use for intra-cluster communications. We will default to HTTP_2, 
remaining backward compatible. But system tests can make use of HTTP 1.1 in 
order to avoid these failures. This will not be a permanent solution to run all 
system tests using HTTP 1.1, but it is more desirable than the constant system 
failures than we see currently.

  was:
We are consistently seeing system test failures. Looking at the logs from 
Github Actions, it appears that LoadBalanceIT is always the first one to fail, 
with the issue then cascading. It seems that the end of the 
LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for 
each of the 100 expected FlowFiles, and this then gets replicated across the 
cluster.

This, in turn, causes connection pool exhaustion, resulting in
{code:java}
IOException: RST_STREAM received {code}
Which comes back as an HTTP 500 error.

That test can be tightened up by producing 20 FlowFiles instead of 100. This 
will reduce the number of requests by 5x, giving us much more breathing room.

 

After digging in, the reduction from 100 FlowFiles to 20 did not provide the 
resilience I was looking for. The issue appears to stem from changes made in 
the latest version of Jetty. It appears that they explicitly 


> Repeated system test failures caused by LoadBalanceIT
> -----------------------------------------------------
>
>                 Key: NIFI-16011
>                 URL: https://issues.apache.org/jira/browse/NIFI-16011
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> We are consistently seeing system test failures. Looking at the logs from 
> Github Actions, it appears that LoadBalanceIT is always the first one to 
> fail, with the issue then cascading. It seems that the end of the 
> LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for 
> each of the 100 expected FlowFiles, and this then gets replicated across the 
> cluster.
> This, in turn, causes connection pool exhaustion, resulting in
> {code:java}
> IOException: RST_STREAM received {code}
> Which comes back as an HTTP 500 error.
> That test can be tightened up by producing 20 FlowFiles instead of 100. This 
> will reduce the number of requests by 5x, giving us much more breathing room.
>  
> After digging in, the reduction from 100 FlowFiles to 20 did not provide the 
> resilience I was looking for. The issue appears to stem from changes made in 
> the latest version of Jetty. It appears that they explicitly and 
> intentionally changed how RST_STREAM resets are handled. Reverting the recent 
> Jetty version change did confirm that system tests pass. Restoring to the 
> latest confirmed failures again. It is important to keep current with Jetty, 
> however, and these issues do not appear to affect production instances. They 
> affect system tests because system tests constantly restart containers while 
> also firing off huge numbers of HTTP requests in very short succession.
> To this end, the approach that I will take is to expose configuring the HTTP 
> version to use for intra-cluster communications. We will default to HTTP_2, 
> remaining backward compatible. But system tests can make use of HTTP 1.1 in 
> order to avoid these failures. This will not be a permanent solution to run 
> all system tests using HTTP 1.1, but it is more desirable than the constant 
> system failures than we see currently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to