[Critical Bug Fix] TTL timer has possibility to lose efficacy for ElasticSearch storage in all <= 9.2.0 release

Sheng Wu Mon, 19 Sep 2022 02:55:00 -0700

## Phenomena
1. The expired indices had never been released before rebooting the
OAP cluster or scaling the cluster size to 1. Found the following logs
in all OAP pod logs, `The selected first getAddress is
xxx.xxx.xx.xx:port. The remove stage is skipped.`
2. Can't reboot the OAP pod and see the endless rolling logs `table:
xxx does not exist. OAP is running in 'no-init' mode, waiting... retry
3s later.`


![image](https://user-images.githubusercontent.com/5441976/190991213-720f0026-a7f7-4050-b62b-103812757a8d.png)

## Root cause
This bug exists for many years, back to Oct. 2018. TTL timer expected
`queryRemoteNodes` always returns a certain ordered OAP instance list,
which makes one OAP node would be selected to take the responsibility
of removing expired indices and create the latest(today's) indices
when rolling.

But, typically and proved, when using k8s coordinator, the k8s
coordinator would not return an order instance list, which could have
no OAP nodes selected, and the TTL timer would not really work in any
case.
In this case, most indices could be created normally as new telemetry
data would trigger index creation automatically. But
1. Expired indices would not be removed
2. Some features(as not be used) may not have new telemetry data, so,
no latest date index was created. But the rebooting would verify and
expect the latest data index, which could lead to Phenomena <2>.

## Fix
The pull request to fix this is https://github.com/apache/skywalking/pull/9632.

Sheng Wu 吴晟
Twitter, wusheng1108

[Critical Bug Fix] TTL timer has possibility to lose efficacy for ElasticSearch storage in all <= 9.2.0 release

Reply via email to