keith-turner opened a new pull request, #4048:
URL: https://github.com/apache/accumulo/pull/4048

   There was a race condition between the two thread in tablet group watcher 
that caused a tablet to be marked as assigned to a dead tablet server when that 
was not true.
   
   This commit fixes that race condition, enables a test that was failing 
because of it, and adds the thread id to test logging.  The thread id was 
useful in confirming this bug so it will probably be useful in the future.
   
   Below are some log messages from tracking this bug down.  The number after 
the timestamp is the thread id.  Can see thread 47 assigns tablet 1<< and then 
thread 46 assumes its assigned to a dead tablet server. However it was a not a 
dead tserver, it had just started as the test restarted tservers.
   
   ```
   2023-12-09T16:04:54,439 47 [manager.Manager] TRACE: [Normal Tablets] 
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state: 
UNASSIGNED, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
   2023-12-09T16:04:54,443 47 [tablet.location] DEBUG: Assigned 1<< to 
localhost:37775[10006c85a80001a]
   2023-12-09T16:04:54,452 47 [manager.Manager] TRACE: [Normal Tablets] 
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state: 
ASSIGNED, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
   2023-12-09T16:04:55,514 46 [manager.Manager] TRACE: [Normal Tablets] 
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state: 
ASSIGNED_TO_DEAD_SERVER, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
   2023-12-09T16:04:55,515 46 [manager.Manager] DEBUG: 1 assigned to dead 
servers: 
[TabletMetadata[tableId=1,prevEndRow=<null>,sawPrevEndRow=true,oldPrevEndRow=<null>,sawOldPrevEndRow=false,endRow=<null>,location=Location
 [server=localhost:37775[10006c85a80001a], 
type=CURRENT],files={},scans=[],loadedFiles={},fetchedCols=[LOCATION, PREV_ROW, 
FILES, LAST, DIR, LOGS, SUSPEND, ECOMP, HOSTING_GOAL, HOSTING_REQUESTED, OPID, 
SELECTED],extent=1<<,last=Location [server=localhost:37775[10006c85a80001a], 
type=LAST],suspend=<null>,dirName=default_tablet,time=<null>,cloned=<null>,flush=OptionalLong.empty,logs=[],splitRatio=<null>,extCompactions={},goal=ONDEMAND,onDemandHostingRequested=true,operationId=<null>,selectedFiles=<null>,futureAndCurrentLocationSet=false]]...
   2023-12-09T16:04:55,536 46 [tablet.location] DEBUG: Suspended 1<< to 
localhost:37775 at 54231 ms with 1 walogs
   2023-12-09T16:04:55,652 46 [manager.Manager] TRACE: [Normal Tablets] 
Shutting down all Tservers: false, dependentCount: null Extent: 1<<, state: 
SUSPENDED, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to