Todd Lipcon commented on KUDU-2342:

Reconstructing the timeline a bit:

- 07:20:54.751998: peer e3fdd8 fell behind the retention and "can never be 
caught up"
- 07:20:54.766460: peer f7376c added as a NON_VOTER
- 07:20:55.268965: tablet copy starts to f7376c
- 07:21:34.559736: tablet copy ends
- 07:21:34.779841: logs held by the tablet copy session are GCed
- 07:21:34.790443: the new NON_VOTER peer is already unable to be caught up 
because the logs just got GCed (*hmm, interesting*)
- 07:21:34.790797: nevertheless, the leader issues a config change to promote 
f7376c to VOTER

Now we have 2/4 VOTER replicas which can never be caught up -- the original bad 
one, and the one we just promoted. Hence we can't make progress.

It seems there are two serious issues at play here:
- why did we not retain the logs between the tablet copy session finishing and 
catching up the peer? perhaps because the non-voter isn't included in the log 
retention calculations and was more than 80 segments behind?
- why did we promote a non-voter that wasn't relatively up to date or in a 
"good" state?

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -----------------------------------------------------------------------------------------------------
>                 Key: KUDU-2342
>                 URL: https://issues.apache.org/jira/browse/KUDU-2342
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.7.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Alexey Serbin
>            Priority: Critical
>              Labels: scalability
>         Attachments: Impala query profile.txt, tablet-info.html
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
>     Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)

This message was sent by Atlassian JIRA

Reply via email to