Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
......................................................................


Patch Set 24:

> Given that we're still chasing strange test failures on this, and
 > it's on a pretty important and tricky code path, maybe we should
 > chat about whether it's really necessary for 1.3.0? i.e are the
 > downside risks of not having it included in the release worse than
 > the downside risks of potential bugs? I haven't followed it closely
 > but as the 1.3 RM I'm feeling nervous about complex patches coming
 > in very close to the first rc being cut (hoping to do that
 > tomorrow)

Yes, that's a very good point.  However, I think I understand what is the 
issue.  The issue is that upon master leadership change the new leader 
sometimes does not see the last successful write from the former leader.  That 
bug can affect table/tablet metadata as well.  I.e., the newly created tablet 
could be overlooked at leadership change, and it will be seen only on the next 
call of ElectedLeaderCb.

E.g., it's possible to take a look at log from 
https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip

This is what I think happened there:

1. The master at 127.0.0.1:11032 generated and successfully wrote TSK with id 0 
(I0309 04:05:44.679442   558 catalog_manager.cc:3432] Generated new TSK 0).  
Later on, re-election happened and it was elected as a leader again, and it 
generated TSK with id 1 but since there is injected latency prior to writing 
the key into the table, it failed to write it into the system table due to 
leadership term change.

2. Some other leader started its leadership duties but failed to caught up as a 
leader.

3. Our former (first) master server became the leader again and it generated 
and successfully written TSK with id 2 (I0309 04:05:46.812899   668 
catalog_manager.cc:3432] Generated new TSK 2)

4. Right after that leadership changed and other master server ran its 
ElectedAsLeaderCb and it did not see the latest TSK record in the system table. 
 Seeing just the record with TSK id 0, it generated and successfully written 
its new TSK with id 1 (I0309 04:05:47.270205   378 catalog_manager.cc:3432] 
Generated new TSK 1)

5. Now, the client has connected to the current master leader which has just 
generated TSK and made it current (TSK rotation period is 2 seconds).  It got 
authn token signed by TSK with id 1.

6. The client tries to execute write operation against the tablet server which 
has received TSKs with id 0 and 2.  The tablet server cannot see the TSK with 
id 1 because the new master does not send it in response since the tablet 
server sends 2 as the latest TSK id.

7. The tablet server responded with 'Runtime error: ERROR_UNAVAILABLE: Not 
authorized: authentication token signed with unknown key' while the client 
tried to negotiate the connection.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: No

Reply via email to