Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] categorization of rw operation failures ......................................................................
Patch Set 24: > > Given that we're still chasing strange test failures on this, and > > it's on a pretty important and tricky code path, maybe we should > > chat about whether it's really necessary for 1.3.0? i.e are the > > downside risks of not having it included in the release worse > than > > the downside risks of potential bugs? I haven't followed it > closely > > but as the 1.3 RM I'm feeling nervous about complex patches > coming > > in very close to the first rc being cut (hoping to do that > > tomorrow) > > Yes, that's a very good point. However, I think I understand what > is the issue. The issue is that upon master leadership change the > new leader sometimes does not see the last successful write from > the former leader. That bug can affect table/tablet metadata as > well. I.e., the newly created tablet could be overlooked at > leadership change, and it will be seen only on the next call of > ElectedLeaderCb. > > E.g., it's possible to take a look at log from > https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip > > This is what I think happened there: > > 1. The master at 127.0.0.1:11032 generated and successfully wrote > TSK with id 0 (I0309 04:05:44.679442 558 catalog_manager.cc:3432] > Generated new TSK 0). Later on, re-election happened and it was > elected as a leader again, and it generated TSK with id 1 but since > there is injected latency prior to writing the key into the table, > it failed to write it into the system table due to leadership term > change. > > 2. Some other leader started its leadership duties but failed to > caught up as a leader. > > 3. Our former (first) master server became the leader again and it > generated and successfully written TSK with id 2 (I0309 > 04:05:46.812899 668 catalog_manager.cc:3432] Generated new TSK 2) > > 4. Right after that leadership changed and other master server ran > its ElectedAsLeaderCb and it did not see the latest TSK record in > the system table. Seeing just the record with TSK id 0, it > generated and successfully written its new TSK with id 1 (I0309 > 04:05:47.270205 378 catalog_manager.cc:3432] Generated new TSK 1) > > 5. Now, the client has connected to the current master leader which > has just generated TSK and made it current (TSK rotation period is > 2 seconds). It got authn token signed by TSK with id 1. > > 6. The client tries to execute write operation against the tablet > server which has received TSKs with id 0 and 2. The tablet server > cannot see the TSK with id 1 because the new master does not send > it in response since the tablet server sends 2 as the latest TSK > id. > > 7. The tablet server responded with 'Runtime error: > ERROR_UNAVAILABLE: Not authorized: authentication token signed with > unknown key' while the client tried to negotiate the connection. Instead of the link to the artifacts it's possible to use http://dist-test.cloudera.org//job?job_id=aserbin.1489032335.25361 and retrieve the artifacts of the very first failure in the list. -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 24 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: No
