keith-turner opened a new pull request, #4048:
URL: https://github.com/apache/accumulo/pull/4048
There was a race condition between the two thread in tablet group watcher
that caused a tablet to be marked as assigned to a dead tablet server when that
was not true.
This commit fixes that race condition, enables a test that was failing
because of it, and adds the thread id to test logging. The thread id was
useful in confirming this bug so it will probably be useful in the future.
Below are some log messages from tracking this bug down. The number after
the timestamp is the thread id. Can see thread 47 assigns tablet 1<< and then
thread 46 assumes its assigned to a dead tablet server. However it was a not a
dead tserver, it had just started as the test restarted tservers.
```
2023-12-09T16:04:54,439 47 [manager.Manager] TRACE: [Normal Tablets]
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state:
UNASSIGNED, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
2023-12-09T16:04:54,443 47 [tablet.location] DEBUG: Assigned 1<< to
localhost:37775[10006c85a80001a]
2023-12-09T16:04:54,452 47 [manager.Manager] TRACE: [Normal Tablets]
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state:
ASSIGNED, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
2023-12-09T16:04:55,514 46 [manager.Manager] TRACE: [Normal Tablets]
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state:
ASSIGNED_TO_DEAD_SERVER, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
2023-12-09T16:04:55,515 46 [manager.Manager] DEBUG: 1 assigned to dead
servers:
[TabletMetadata[tableId=1,prevEndRow=<null>,sawPrevEndRow=true,oldPrevEndRow=<null>,sawOldPrevEndRow=false,endRow=<null>,location=Location
[server=localhost:37775[10006c85a80001a],
type=CURRENT],files={},scans=[],loadedFiles={},fetchedCols=[LOCATION, PREV_ROW,
FILES, LAST, DIR, LOGS, SUSPEND, ECOMP, HOSTING_GOAL, HOSTING_REQUESTED, OPID,
SELECTED],extent=1<<,last=Location [server=localhost:37775[10006c85a80001a],
type=LAST],suspend=<null>,dirName=default_tablet,time=<null>,cloned=<null>,flush=OptionalLong.empty,logs=[],splitRatio=<null>,extCompactions={},goal=ONDEMAND,onDemandHostingRequested=true,operationId=<null>,selectedFiles=<null>,futureAndCurrentLocationSet=false]]...
2023-12-09T16:04:55,536 46 [tablet.location] DEBUG: Suspended 1<< to
localhost:37775 at 54231 ms with 1 walogs
2023-12-09T16:04:55,652 46 [manager.Manager] TRACE: [Normal Tablets]
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state:
SUSPENDED, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]