Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] categorization of rw operation failures ......................................................................
Patch Set 24: > Given that we're still chasing strange test failures on this, and > it's on a pretty important and tricky code path, maybe we should > chat about whether it's really necessary for 1.3.0? i.e are the > downside risks of not having it included in the release worse than > the downside risks of potential bugs? I haven't followed it closely > but as the 1.3 RM I'm feeling nervous about complex patches coming > in very close to the first rc being cut (hoping to do that > tomorrow) Yes, that's a very good point. However, I think I understand what is the issue. The issue is that upon master leadership change the new leader sometimes does not see the last successful write from the former leader. That bug can affect table/tablet metadata as well. I.e., the newly created tablet could be overlooked at leadership change, and it will be seen only on the next call of ElectedLeaderCb. E.g., it's possible to take a look at log from https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip This is what I think happened there: 1. The master at 127.0.0.1:11032 generated and successfully wrote TSK with id 0 (I0309 04:05:44.679442 558 catalog_manager.cc:3432] Generated new TSK 0). Later on, re-election happened and it was elected as a leader again, and it generated TSK with id 1 but since there is injected latency prior to writing the key into the table, it failed to write it into the system table due to leadership term change. 2. Some other leader started its leadership duties but failed to caught up as a leader. 3. Our former (first) master server became the leader again and it generated and successfully written TSK with id 2 (I0309 04:05:46.812899 668 catalog_manager.cc:3432] Generated new TSK 2) 4. Right after that leadership changed and other master server ran its ElectedAsLeaderCb and it did not see the latest TSK record in the system table. Seeing just the record with TSK id 0, it generated and successfully written its new TSK with id 1 (I0309 04:05:47.270205 378 catalog_manager.cc:3432] Generated new TSK 1) 5. Now, the client has connected to the current master leader which has just generated TSK and made it current (TSK rotation period is 2 seconds). It got authn token signed by TSK with id 1. 6. The client tries to execute write operation against the tablet server which has received TSKs with id 0 and 2. The tablet server cannot see the TSK with id 1 because the new master does not send it in response since the tablet server sends 2 as the latest TSK id. 7. The tablet server responded with 'Runtime error: ERROR_UNAVAILABLE: Not authorized: authentication token signed with unknown key' while the client tried to negotiate the connection. -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 24 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: No
