panhongan opened a new issue, #15468:
URL: https://github.com/apache/druid/issues/15468
Please provide a detailed title (e.g. "Broker crashes when using TopN query
with Bound filter" instead of just "Broker crashes").
### Affected Version
0.19.0 -> 0.23.0
The Druid version where the problem was encountered.
### Description
When the kafka ingestion task trigger the `maxRowsPerSegment` condition, the
task will send `CheckpointAction` to supervisor.
Then the supervisor will execute checkpoint by `taskClient.pauseAsync()` and
`taskClient.segEndOffsetsAsync()`.
But during the `pause` stage, the supervisor will receive exception:
```
2023-11-30T03:40:40,645 WARN [IndexTaskClient-datasource1-1]
org.apache.druid.indexing.common.IndexTaskClient - Exception while sending
request
org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body:
Can't pause, task is not in a pausable state (state: [PAUSED])
```
Then the supervisor will kill the ingestion task.
This issue was seen in our production since 0.19.0 to 0.23.0. And happened
everyday, but happened not for each datasource, not at fixed time window.
Can you help take a look why?
The exception from code: `SeekableStreamIndexTaskRunner::pause()`
```
public Response pause() throws InterruptedException
{
if (!(status == Status.PAUSED || status == Status.READING)) {
return Response.status(Response.Status.BAD_REQUEST)
.entity(StringUtils.format("Can't pause, task is not in
a pausable state (state: [%s])", status))
.build();
}
// .... not copy other code
}
```
From the log message, if the `status` value is `PAUSED`, why the `if` can be
hit. Looks very odd.
So I suspect that before the `if` statement, the value is `NOT_STARTED` or
`STARTING`. By analyzing the code, high probability the value is `NOT_STARTED`.
By tracing the `pause()` invoke, the thread is `qtp1608217492-238`, but the
task runner thread is `task-runner-0-priority-0`. And it is the first time for
`qtp16xxxxx` to invoke the `pause()`. Looks the `qtp` thread not get the
latest value, but after hit the `if` statement and when construct the response
message, got the latest value: `PAUSED`.
For the `volatile`, the type of `Status` is an object type, not for
primitive type, so `volatile` only ensure the reference of status can be read
from main memory every time, but not ensure the object is the latest.
This is my analysis, need your double check.
Bug fix: https://github.com/apache/druid/pull/15473
Please include as much detailed information about the problem as possible.
- Cluster size
- Configurations in use
- Steps to reproduce the problem
- The error message or stack traces encountered. Providing more context,
such as nearby log messages or even entire logs, can be helpful.
- Any debugging that you have already done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]