dtspence commented on issue #3368: URL: https://github.com/apache/accumulo/issues/3368#issuecomment-1568854630
When attempting to stop a single t-server (i.e. `bin/accumulo-cluster stop-here`) on a test system (Cluster-1) the command-line stop hung and the t-server continued to receive assignments. The monitor showed the t-server assignment number cycle up/down. From the logs, the manager appeared to assign tablets back to the t-server being stopped. We attempted to reproduce the issue on a second test cluster (Cluster-2) and the t-server shutdown as expected. However, we were not sure if the assignment (Cluster-2) which occurs after the t-server shutdown is expected or un-expected (expanded below). For the purposes of below, Cluster-1 will be describing the t-servers with the shutdown issue and Cluster-2 will be the other cluster. We currently have noticed two configuration differences between systems: - Cluster-1 = Accumulo metadata and root tablets are spread on system experiencing the t-server shutdown hangs. The tablet.suspend.duration=0s. - Cluster-2 = Accumulo metadata and root tablets are pinned to specific hosts. The tablet.suspend.duration=300s. The following was observed on Cluster-1 (t-server hanging during shutdown): ``` - Seeding FATE[...] Shutdown tserver <shutdown-host:port> [...] - Tablet Server shutdown requested for <shutdown-host:pot> [...] - tablet <tid;begin;end> was unloaded from <shutdown-host:port> [...] - tablet ... was unloaded on <shutdown-host:port> [...] - tablet ... was unloaded on <shutdown-host:port> [...] - tablet ... was unloaded on <shutdown-host:port> [...] - ... - Sending 1 tablets to balancer for table accumulo.metadata for assignment within t-servers [..., <shutdown-host:port>, ...] - Assigning 1 tablets - Assigned !0,~del... to <shutdown-host:port> [....] - tablet ... was unloaded on <shutdown-host:port> [...] - tablet ... was unloaded on <shutdown-host:port> [...] - tablet ... was unloaded on <shutdown-host:port> [...] - tablet <!0,~del> was loaded on <shutdown-host:port> [...] - tablet ... was unloaded on <shutdown-host:port> [...] - Sending 14 tablets to balancer for table <application-table> for assignment within t-servers [..., <shutdown-host:port>, ...] - ... - Assigning XXXX tablets - ... ``` The Cluster-2 logs reflect the following: ``` - tablet ... was unloaded on <shutdown-host:port> [...] - ... - tablet ... was unloaded on <shutdown-host:port> [...] - tablet server hosts no tablets <shutdown-host:port> [...] - unable to get tablet server status <shutdown-host:port> [...] - IOException: No connection to <shutdown-host:port> [...] - not balancing because the balance information is out of date - not balancing just yet, as collection of live tservers is in flux - Sending X tablets to balancer for table <user-table> for assignment within t-servers [..., <shutdown-host:port>, ...] - Assigned <user-tablet-1> to <shutdown-host:port> - ... - Assigned <user-tablet-n> to <shutdown-host:port> [...] - Could not connect to server <shutdown-host:port> [...] - .... - Could not connect to server <shutdown-host:port> [...] - [Normal Tablets]: 10237 tablets are SUSPENDED - Detected change in current tserver set, re-running state machine. - not balancing because there are unhosted tablets: 10237 - 9597 assigned to dead servers [<user-tablet-1@(Location [server=<shutdown-host:port>, type=FUTURE],null,Location[server=<shutdown-host:port>,type=LAST]),..... - Suspended <user-tablet-1> to <shutdown-host:port> at <> with 1 walogs - Sending X tablets to balancer for <user-table> for assignment within tservers [...] - balancer assigned <user-tablet-> to <host-a:port> which is not the suggested location of <shutdown-host:port> ``` The Cluster-2 logs above show the t-server shutdown after the tablet unload. However, we were not sure if the assignment after the unload is expected or un-expected (and would be similar to Cluster-1). The assigned tablets are sent to the shutdown t-server (and the list of candidate t-servers contains the shutdown t-server). The manager observes the server is unreachable and after tablet suspension, the candidate t-server in the next assignment list no longer includes the shutdown t-server. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
