[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

Dinesh Bhat (JIRA) Fri, 14 Oct 2016 11:59:34 -0700

    [ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576176#comment-15576176
 ]


Dinesh Bhat commented on KUDU-1618:
-----------------------------------

I was trying to repro an issue where I was not able to do a remote tablet copy 
onto a local_replica if the tablet was DELETE_TOMBSTONED(but has metadata file 
present). However along with the issue reproduction, I saw one state of the 
replica which was confusing. Here are the steps I executed:
1. Bring up a cluster with 1 master, 3 tablet servers hosting 3 tablets, each 
tablet had 3 replicas.
2. There was a standby server which was added later.
3. KILL one tserver, after 5 mins the all replicas on that tserver failover to 
new standby.
4. Use 'local_replica copy_from_remote' to copy one tablet replica before 
bringing up, the command fails:
{noformat}
I1013 16:43:41.523896 30948 tablet_copy_service.cc:124] Beginning new tablet 
copy session on tablet 048c7d202da3469eb1b1973df9510007 from peer 
bb2517bc5f2b4980bb07c06019b5a8e9 at {real_user=dinesh, eff_user=} at 
127.61.33.8:40240: session id = 
bb2517bc5f2b4980bb07c06019b5a8e9-048c7d202da3469eb1b1973df9510007
I1013 16:43:41.524291 30948 tablet_copy_session.cc:142] T 
048c7d202da3469eb1b1973df9510007 P 19acc272821d425582d3dfb9ed2ab7cd: Tablet 
Copy: opened 0 blocks and 1 log segments
Already present: Tablet already exists: 048c7d202da3469eb1b1973df9510007
{noformat}
5. Remove the metadata file and WAL log for that tablet, and the 
copy_from_fremote succeeds at this point(expected).
6. Bring up the killed tserver, now all replicas on this are tombstoned except 
one tablet for which we did a copy_from_remote in step 5. Master who was 
incessantly trying to TOMBSTONED the evicted replicas on the tserver which was 
down earlier, throws some interesting log:
{noformat}
[dinesh@ve0518 debug]$ I1013 16:55:54.551717 26141 catalog_manager.cc:2591] 
Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
048c7d202da3469eb1b1973df9510007 on bb2517bc5f2b4980bb07c06019b5a8e9 
(127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new 
config with opid_index 4)
W1013 16:55:54.552803 26141 catalog_manager.cc:2552] TS 
bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): delete failed for tablet 
048c7d202da3469eb1b1973df9510007 due to a CAS failure. No further retry: 
Illegal state: Request specified cas_config_opid_index_less_or_equal of -1 but 
the committed config has opid_index of 5
I1013 16:55:54.884133 26141 catalog_manager.cc:2591] Sending 
DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
e9481b695d34483488af07dfb94a8557 on bb2517bc5f2b4980bb07c06019b5a8e9 
(127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new 
config with opid_index 3)
I1013 16:55:54.885964 26141 catalog_manager.cc:2567] TS 
bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet 
e9481b695d34483488af07dfb94a8557 (table test-table 
[id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted
I1013 16:55:54.915202 26141 catalog_manager.cc:2591] Sending 
DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
e3ff6a1529cf46c5b9787fe322a749e6 on bb2517bc5f2b4980bb07c06019b5a8e9 
(127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new 
config with opid_index 3)
I1013 16:55:54.916774 26141 catalog_manager.cc:2567] TS 
bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet 
e3ff6a1529cf46c5b9787fe322a749e6 (table test-table 
[id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted
{noformat}
7. It continuously spews log messages like this now:
{noformat}
[dinesh@ve0518 debug]$ W1013 16:55:36.608486  6519 raft_consensus.cc:461] T 
048c7d202da3469eb1b1973df9510007 P bb2517bc5f2b4980bb07c06019b5a8e9 [term 5 
NON_PARTICIPANT]: Failed to trigger leader election: Illegal state: Not 
starting election: Node is currently a non-participant in the raft config: 
opid_index: 5 OBSOLETE_local: false peers { permanent_uuid: 
"9acfc108d9b446c1be783b6d6e7b49ef" member_type: VOTER last_known_addr { host: 
"127.95.58.0" port: 33932 } } peers { permanent_uuid: 
"b11d2af1457b4542808407b4d4d1bd29" member_type: VOTER last_known_addr { host: 
"127.95.58.2" port: 40670 } } peers { permanent_uuid: 
"19acc272821d425582d3dfb9ed2ab7cd" member_type: VOTER last_known_addr { host: 
"127.61.33.8" port: 63532 } }
{noformat}

> Add local_replica tool to delete a replica
> ------------------------------------------
>
>                 Key: KUDU-1618
>                 URL: https://issues.apache.org/jira/browse/KUDU-1618
>             Project: Kudu
>          Issue Type: Improvement
>          Components: ops-tooling
>    Affects Versions: 1.0.0
>            Reporter: Todd Lipcon
>            Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

Reply via email to