dtspence commented on issue #3368:
URL: https://github.com/apache/accumulo/issues/3368#issuecomment-1568854630

   When attempting to stop a single t-server (i.e. `bin/accumulo-cluster 
stop-here`) on a test system (Cluster-1) the command-line stop hung and the 
t-server continued to receive assignments. The monitor showed the t-server 
assignment number cycle up/down. From the logs, the manager appeared to assign 
tablets back to the t-server being stopped. We attempted to reproduce the issue 
on a second test cluster (Cluster-2) and the t-server shutdown as expected. 
However, we were not sure if the assignment (Cluster-2) which occurs after the 
t-server shutdown is expected or un-expected (expanded below).
   
   For the purposes of below, Cluster-1 will be describing the t-servers with 
the shutdown issue and Cluster-2 will be the other cluster.
   
   We currently have noticed two configuration differences between systems:
   - Cluster-1 = Accumulo metadata and root tablets are spread on system 
experiencing the t-server shutdown hangs. The tablet.suspend.duration=0s.
   - Cluster-2 = Accumulo metadata and root tablets are pinned to specific 
hosts. The tablet.suspend.duration=300s.
   
   The following was observed on Cluster-1 (t-server hanging during shutdown):
   ```
   - Seeding FATE[...] Shutdown tserver <shutdown-host:port> [...]
   - Tablet Server shutdown requested for <shutdown-host:pot> [...]
   - tablet <tid;begin;end> was unloaded from <shutdown-host:port> [...]
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - ...
   - Sending 1 tablets to balancer for table accumulo.metadata for assignment 
within t-servers [..., <shutdown-host:port>, ...]
   - Assigning 1 tablets
   - Assigned !0,~del... to <shutdown-host:port> [....]
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - tablet <!0,~del> was loaded on <shutdown-host:port> [...]
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - Sending 14 tablets to balancer for table <application-table> for 
assignment within t-servers [..., <shutdown-host:port>, ...]
   - ...
   - Assigning XXXX tablets
   - ...
   ```
   
   The Cluster-2 logs reflect the following:
   ```
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - ...
   - tablet ... was unloaded on <shutdown-host:port> [...]
   - tablet server hosts no tablets <shutdown-host:port> [...]
   - unable to get tablet server status <shutdown-host:port> [...]
   - IOException: No connection to <shutdown-host:port> [...]
   - not balancing because the balance information is out of date
   - not balancing just yet, as collection of live tservers is in flux
   - Sending X tablets to balancer for table <user-table> for assignment within 
t-servers  [..., <shutdown-host:port>, ...]
   - Assigned <user-tablet-1> to  <shutdown-host:port>
   - ...
   - Assigned <user-tablet-n> to  <shutdown-host:port> [...]
   - Could not connect to server <shutdown-host:port> [...]
   - ....
   - Could not connect to server <shutdown-host:port> [...]
   - [Normal Tablets]: 10237 tablets are SUSPENDED
   - Detected change in current tserver set, re-running state machine.
   - not balancing because there are unhosted tablets: 10237
   - 9597 assigned to dead servers [<user-tablet-1@(Location 
[server=<shutdown-host:port>, 
type=FUTURE],null,Location[server=<shutdown-host:port>,type=LAST]),.....
   - Suspended <user-tablet-1> to <shutdown-host:port> at <> with 1 walogs
   - Sending X tablets to balancer for <user-table> for assignment within 
tservers [...]        
   - balancer assigned <user-tablet-> to <host-a:port> which is not the 
suggested location of <shutdown-host:port>
   ```
   
   The Cluster-2 logs above show the t-server shutdown after the tablet unload. 
However, we were not sure if the assignment after the unload is expected or 
un-expected (and would be similar to Cluster-1). The assigned tablets are sent 
to the shutdown t-server (and the list of candidate t-servers contains the 
shutdown t-server). The manager observes the server is unreachable and after 
tablet suspension, the candidate t-server in the next assignment list no longer 
includes the shutdown t-server. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to