[ https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576176#comment-15576176 ]
Dinesh Bhat commented on KUDU-1618: ----------------------------------- I was trying to repro an issue where I was not able to do a remote tablet copy onto a local_replica if the tablet was DELETE_TOMBSTONED(but has metadata file present). However along with the issue reproduction, I saw one state of the replica which was confusing. Here are the steps I executed: 1. Bring up a cluster with 1 master, 3 tablet servers hosting 3 tablets, each tablet had 3 replicas. 2. There was a standby server which was added later. 3. KILL one tserver, after 5 mins the all replicas on that tserver failover to new standby. 4. Use 'local_replica copy_from_remote' to copy one tablet replica before bringing up, the command fails: {noformat} I1013 16:43:41.523896 30948 tablet_copy_service.cc:124] Beginning new tablet copy session on tablet 048c7d202da3469eb1b1973df9510007 from peer bb2517bc5f2b4980bb07c06019b5a8e9 at {real_user=dinesh, eff_user=} at 127.61.33.8:40240: session id = bb2517bc5f2b4980bb07c06019b5a8e9-048c7d202da3469eb1b1973df9510007 I1013 16:43:41.524291 30948 tablet_copy_session.cc:142] T 048c7d202da3469eb1b1973df9510007 P 19acc272821d425582d3dfb9ed2ab7cd: Tablet Copy: opened 0 blocks and 1 log segments Already present: Tablet already exists: 048c7d202da3469eb1b1973df9510007 {noformat} 5. Remove the metadata file and WAL log for that tablet, and the copy_from_fremote succeeds at this point(expected). 6. Bring up the killed tserver, now all replicas on this are tombstoned except one tablet for which we did a copy_from_remote in step 5. Master who was incessantly trying to TOMBSTONED the evicted replicas on the tserver which was down earlier, throws some interesting log: {noformat} [dinesh@ve0518 debug]$ I1013 16:55:54.551717 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 048c7d202da3469eb1b1973df9510007 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 4) W1013 16:55:54.552803 26141 catalog_manager.cc:2552] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): delete failed for tablet 048c7d202da3469eb1b1973df9510007 due to a CAS failure. No further retry: Illegal state: Request specified cas_config_opid_index_less_or_equal of -1 but the committed config has opid_index of 5 I1013 16:55:54.884133 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet e9481b695d34483488af07dfb94a8557 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 3) I1013 16:55:54.885964 26141 catalog_manager.cc:2567] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet e9481b695d34483488af07dfb94a8557 (table test-table [id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted I1013 16:55:54.915202 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet e3ff6a1529cf46c5b9787fe322a749e6 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 3) I1013 16:55:54.916774 26141 catalog_manager.cc:2567] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet e3ff6a1529cf46c5b9787fe322a749e6 (table test-table [id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted {noformat} 7. It continuously spews log messages like this now: {noformat} [dinesh@ve0518 debug]$ W1013 16:55:36.608486 6519 raft_consensus.cc:461] T 048c7d202da3469eb1b1973df9510007 P bb2517bc5f2b4980bb07c06019b5a8e9 [term 5 NON_PARTICIPANT]: Failed to trigger leader election: Illegal state: Not starting election: Node is currently a non-participant in the raft config: opid_index: 5 OBSOLETE_local: false peers { permanent_uuid: "9acfc108d9b446c1be783b6d6e7b49ef" member_type: VOTER last_known_addr { host: "127.95.58.0" port: 33932 } } peers { permanent_uuid: "b11d2af1457b4542808407b4d4d1bd29" member_type: VOTER last_known_addr { host: "127.95.58.2" port: 40670 } } peers { permanent_uuid: "19acc272821d425582d3dfb9ed2ab7cd" member_type: VOTER last_known_addr { host: "127.61.33.8" port: 63532 } } {noformat} > Add local_replica tool to delete a replica > ------------------------------------------ > > Key: KUDU-1618 > URL: https://issues.apache.org/jira/browse/KUDU-1618 > Project: Kudu > Issue Type: Improvement > Components: ops-tooling > Affects Versions: 1.0.0 > Reporter: Todd Lipcon > Assignee: Dinesh Bhat > > Occasionally we've hit cases where a tablet is corrupt in such a way that the > tserver fails to start or crashes soon after starting. Typically we'd prefer > the tablet just get marked FAILED but in the worst case it causes the whole > tserver to fail. > For these cases we should add a 'local_replica' subtool to fully remove a > local tablet. Related, it might be useful to have a 'local_replica archive' > which would create a tarball from the data in this tablet for later > examination by developers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)