[jira] [Commented] (CAMEL-19491) Failing healthcheck on aws2-sqs causes readiness check to be stuck

Simon Rasmussen (Jira) Mon, 17 Jul 2023 01:19:28 -0700


    [ 
https://issues.apache.org/jira/browse/CAMEL-19491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743669#comment-17743669
 ]


Simon Rasmussen commented on CAMEL-19491:
-----------------------------------------

We will try to upgrade to 3.20.6, but judging from 
[https://camel.apache.org/releases/release-3.20.6|https://camel.apache.org/releases/release-3.20.6/,]
 release notes, there are no changes which could have fixed this...

It is not easy to reproduce this as it requires a call to `listQueues` to 
actually fail. Maybe we can instruct `AmazonSQSClientMock` to fail in this way, 
or test this by temporarily revoking the permissions to `listQueue` for the 
user calling it.

We will do a bit more digging on this issue and report back.

[~rhuanrocha] , I don't think your analysis on this issue is correct. The 
reConnectToQueue() is used only in case of  QueueDoesNotExistException which 
occurs when a queue is not created yet. It is then attempted to be created (if 
there is enough permissions to do so). This is definitely not what we are being 
hit by. We have our application running for days, and then a single request to 
listQueues fails in the health check, this causes the health check to not be 
called ever again, and camel is reported as DOWN. I'm guessing that something 
in the health check infrastructure is incorrect in camel itself which causes it 
to not retry the health check. Meanwhile, polling happily continues...

> Failing healthcheck on aws2-sqs causes readiness check to be stuck
> ------------------------------------------------------------------
>
>                 Key: CAMEL-19491
>                 URL: https://issues.apache.org/jira/browse/CAMEL-19491
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-aws2
>    Affects Versions: 3.20.5
>            Reporter: Simon Rasmussen
>            Priority: Major
>              Labels: easy, help-wanted
>
> If a Sqs2ConsumerHealthCheck returns DOWN, then it will not recover back to 
> UP, despite the consumer polling messages and processing them successfully.
> Detected on SQS, but likely all aws2 components are affected by this.
> We have experienced this a few times in production now (on various camel 
> 2.20.x versions), including 3.20.5.
> Actual output from our readiness check:
> {noformat}
> {"status":"DOWN","components":{"camelHealth":{"status":"DOWN","details":{"name":"camel-health-check","consumer:queue_name":"DOWN"}},"db":{"status":"UP","details":{"database":"MariaDB","validationQuery":"isValid()"}},"diskSpace":{"status":"UP","details":{"total":64411906048,"free":13194371072,"threshold":10485760,"exists":true}},"ping":{"status":"UP"},"readinessState":{"status":"UP"}}}{noformat}
> Notice how the health check prefix is absent: aws2-sqs-consumer-
> I noticed that the tests of this functionality are manually plumbing the 
> setup.
> I also see that Sqs2ConsumerHealthCheck extends AbstractHealthCheck, but 
> shouldn't this be ConsumerHealthCheck instead?
> My availability does not allow for attempting to fix this myself, thus I've 
> just created this ticket for now, maybe someone else is up for grabbing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CAMEL-19491) Failing healthcheck on aws2-sqs causes readiness check to be stuck

Reply via email to