[ 
https://issues.apache.org/jira/browse/KUDU-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1839:
-------------------------------------
    Priority: Major  (was: Critical)

> DNS failure during tablet creation lead to undeletable tablet
> -------------------------------------------------------------
>
>                 Key: KUDU-1839
>                 URL: https://issues.apache.org/jira/browse/KUDU-1839
>             Project: Kudu
>          Issue Type: Bug
>          Components: master, tablet
>    Affects Versions: 1.2.0
>            Reporter: Adar Dembo
>
> During a YCSB workload, two tservers died due to DNS resolution timeouts. For 
> example: 
> {noformat}
> F0117 09:21:14.952937  8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
> status: Network error: Could not obtain a remote proxy to the peer.: Unable 
> to resolve address 've0130.halxg.cloudera.com': Name or service not known
> {noformat}
> It's not clear why this happened; perhaps table creation places an inordinate 
> strain on DNS due to concurrent resolution load from all the bootstrapping 
> peers.
> In any case, when these tservers were restarted, two tablets failed to 
> bootstrap, both for the same reason. I'll focus on just one tablet from here 
> on out to simplify troubleshooting:
> {noformat}
> E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T 
> 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet 
> failed to bootstrap: Not found: Unable to load Consensus metadata: 
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or 
> directory (error 2)
> {noformat}
> Eventually, the master decided to delete this tablet:
> {noformat}
> I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> {noformat}
> As can be seen by the presence of multiple deletion requests, each one 
> failed. It's annoying that the tserver didn't log why. But the master did:
> {noformat}
> I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending 
> DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
> 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 
> (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not 
> found in new config with opid_index 29)
> W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS 
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete 
> failed for tablet 8c167c441a7d44b8add737d13797e694 with error code 
> TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting 
> down
> I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of 
> 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for 
> TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)...
> {noformat}
> This isn't a fatal error as far as the master is concerned, so it retries the 
> deletion forever.
> Meanwhile, the broken replica of this tablet still appears to be part of the 
> replication group. At least, that's true as far as both the master web UI and 
> the tserver web UI are concerned. The leader tserver is logging this error 
> repeatedly:
> {noformat}
> W0117 16:38:04.797828 81809 consensus_peers.cc:329] T 
> 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer 
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't 
> send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet 
> 8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12). 
> Status: Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load 
> Consensus metadata: 
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or 
> directory (error 2). Retrying in the next heartbeat period. Already tried 
> 6666 times.
> {noformat}
> It's not clear to me exactly what state the replication group is in. The 
> master did issue an AddServer request:
> {noformat}
> I0117 15:42:32.117065 33903 catalog_manager.cc:3069] Started AddServer task 
> for tablet 8c167c441a7d44b8add737d13797e694
> {noformat}
> But the leader of the tablet still thinks the broken replica is in the 
> replication group. So is it a tablet with two healthy replicas and one broken 
> one, that can't recover? Maybe.
> So a couple things are broken here:
> # Table creation probably created a DNS resolution storm.
> # Failure in DNS resolution is not retried, and led to tserver death.
> # On bootstrap, this replica was detected as having a tablet-meta file but no 
> consensus-meta, and was set aside as corrupt (good). But the lack of a 
> consensus-meta means there's no consensus state and so the tserver cannot 
> perform an "atomic delete" as requested by the master. Must we manually 
> delete this replica? Or should the master be able to force the issue?
> # The tserver did not log the tablet deletion failure.
> # The master retried the deletion in perpetuity.
> # Re-replication of this tablet by the leader appears to be broken.
> I think at least some of these issues are tracked in other JIRAs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to