Adar Dembo created KUDU-2918:
--------------------------------

             Summary: Rebalancer can fail when a service queue is full
                 Key: KUDU-2918
                 URL: https://issues.apache.org/jira/browse/KUDU-2918
             Project: Kudu
          Issue Type: Bug
          Components: CLI, ksck
    Affects Versions: 1.11.0
            Reporter: Adar Dembo


The various low-level RPCs issued by ksck aren't retried if the corresponding 
service queues are full. These include GetConsensusState, GetStatus, and 
ListTablets.

Without retries, ksck (and the rebalancer) can fail midway:
{noformat}
I0812 11:21:10.669682 42799 rebalancer.cc:831] tablet 
d729fb149e804696a0862adacb725d66: a0dca75bbbfb4de69616694834adf930 -> 
24d0eb73b3c64a0f901ae092186b3439 move is abandoned: Remote error: Service 
unavailable: GetConsensusState request on kudu.consensus.ConsensusService from 
10.17.182.15:50754 dropped due to backpressure. The service queue is full; it 
has 50 items.
I0812 11:21:10.871894 42799 rebalancer.cc:239] re-synchronizing cluster state
Illegal state: tablet server 0d88ff7360b74d1e81cd2ccd41fab8a5 
(foo.bar.com:7050): unacceptable health status UNAVAILABLE
{noformat}

The helper classes in rpc/rpc.h may be useful here.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to