[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397505#comment-16397505
 ] 

Todd Lipcon commented on KUDU-2342:
-----------------------------------

It appears what happened is that the leader actaully got 80 segments ahead of 
the two followers, and since our default log_max_segments_to_retain=80, it GCed 
the logs anyway. Then it couldn't replicate to either follower and the tablet 
got stuck. I checked the earliest WAL on that server (wal-000001141) and its 
earliest op is 1.1154.

What's a bit odd here is that the leader watermark thinks that 1232 is the 
committed index and the majority-replicated, but it wants to send ops 1143 and 
1055 to the two peers. Also interesting is that it appears this tablet is 
currently in a configuration with four VOTER replicas.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2342
>                 URL: https://issues.apache.org/jira/browse/KUDU-2342
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.7.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Alexey Serbin
>            Priority: Critical
>              Labels: scalability
>         Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
>     Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to