[
https://issues.apache.org/jira/browse/KUDU-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated KUDU-1839:
-------------------------------------
Priority: Major (was: Critical)
> DNS failure during tablet creation lead to undeletable tablet
> -------------------------------------------------------------
>
> Key: KUDU-1839
> URL: https://issues.apache.org/jira/browse/KUDU-1839
> Project: Kudu
> Issue Type: Bug
> Components: master, tablet
> Affects Versions: 1.2.0
> Reporter: Adar Dembo
>
> During a YCSB workload, two tservers died due to DNS resolution timeouts. For
> example:
> {noformat}
> F0117 09:21:14.952937 8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad
> status: Network error: Could not obtain a remote proxy to the peer.: Unable
> to resolve address 've0130.halxg.cloudera.com': Name or service not known
> {noformat}
> It's not clear why this happened; perhaps table creation places an inordinate
> strain on DNS due to concurrent resolution load from all the bootstrapping
> peers.
> In any case, when these tservers were restarted, two tablets failed to
> bootstrap, both for the same reason. I'll focus on just one tablet from here
> on out to simplify troubleshooting:
> {noformat}
> E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T
> 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet
> failed to bootstrap: Not found: Unable to load Consensus metadata:
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or
> directory (error 2)
> {noformat}
> Eventually, the master decided to delete this tablet:
> {noformat}
> I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> {noformat}
> As can be seen by the presence of multiple deletion requests, each one
> failed. It's annoying that the tserver didn't log why. But the master did:
> {noformat}
> I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending
> DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet
> 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68
> (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not
> found in new config with opid_index 29)
> W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete
> failed for tablet 8c167c441a7d44b8add737d13797e694 with error code
> TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting
> down
> I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of
> 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for
> TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)...
> {noformat}
> This isn't a fatal error as far as the master is concerned, so it retries the
> deletion forever.
> Meanwhile, the broken replica of this tablet still appears to be part of the
> replication group. At least, that's true as far as both the master web UI and
> the tserver web UI are concerned. The leader tserver is logging this error
> repeatedly:
> {noformat}
> W0117 16:38:04.797828 81809 consensus_peers.cc:329] T
> 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't
> send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet
> 8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12).
> Status: Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load
> Consensus metadata:
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or
> directory (error 2). Retrying in the next heartbeat period. Already tried
> 6666 times.
> {noformat}
> It's not clear to me exactly what state the replication group is in. The
> master did issue an AddServer request:
> {noformat}
> I0117 15:42:32.117065 33903 catalog_manager.cc:3069] Started AddServer task
> for tablet 8c167c441a7d44b8add737d13797e694
> {noformat}
> But the leader of the tablet still thinks the broken replica is in the
> replication group. So is it a tablet with two healthy replicas and one broken
> one, that can't recover? Maybe.
> So a couple things are broken here:
> # Table creation probably created a DNS resolution storm.
> # Failure in DNS resolution is not retried, and led to tserver death.
> # On bootstrap, this replica was detected as having a tablet-meta file but no
> consensus-meta, and was set aside as corrupt (good). But the lack of a
> consensus-meta means there's no consensus state and so the tserver cannot
> perform an "atomic delete" as requested by the master. Must we manually
> delete this replica? Or should the master be able to force the issue?
> # The tserver did not log the tablet deletion failure.
> # The master retried the deletion in perpetuity.
> # Re-replication of this tablet by the leader appears to be broken.
> I think at least some of these issues are tracked in other JIRAs.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)