[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397479#comment-16397479
 ] 

Todd Lipcon commented on KUDU-2342:
-----------------------------------

The server vc1515 has the following spewing in its logs:

{code}
I0313 11:56:27.615651 43703 consensus_peers.cc:230] T 
b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer 
f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f7376c96c6b64e7fa6a7bfc84fd0cd64. Status: 
Not found: Failed to read ops 1143..1221: Segment 1130 which contained index 
1143 has been GCed
I0313 11:56:27.973654 43703 consensus_peers.cc:230] T 
b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer 
e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: e3fdd8da21a643aba21b7acdd6b17499. Status: 
Not found: Failed to read ops 1055..1221: Segment 1043 which contained index 
1055 has been GCed
{code}

in other words, it appears to have evicted the log segments necessary to catch 
up both of its followers. Thus it's unable to replicate and commit any writes, 
so the write here timed out. Instead of letting it time out we should of course 
respond more rapidly saying that the tablet is unavailable, but that's a 
separate issue.

I guess in this case we can't recover because it wont evict a follower either 
because it knows that it wouldn't be able to commit the config change. So, how 
did it get into the state where it had GCed logs behind the majority_replicated 
watermark? [~aserbin] said he can take a look

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2342
>                 URL: https://issues.apache.org/jira/browse/KUDU-2342
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.7.0
>            Reporter: Mostafa Mokhtar
>            Priority: Major
>              Labels: scalability
>         Attachments: Impala query profile.txt
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
>     Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to