Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
......................................................................


Patch Set 24:

> > Given that we're still chasing strange test failures on this, and
 > > it's on a pretty important and tricky code path, maybe we should
 > > chat about whether it's really necessary for 1.3.0? i.e are the
 > > downside risks of not having it included in the release worse
 > than
 > > the downside risks of potential bugs? I haven't followed it
 > closely
 > > but as the 1.3 RM I'm feeling nervous about complex patches
 > coming
 > > in very close to the first rc being cut (hoping to do that
 > > tomorrow)
 > 
 > Yes, that's a very good point.  However, I think I understand what
 > is the issue.  The issue is that upon master leadership change the
 > new leader sometimes does not see the last successful write from
 > the former leader.  That bug can affect table/tablet metadata as
 > well.  I.e., the newly created tablet could be overlooked at
 > leadership change, and it will be seen only on the next call of
 > ElectedLeaderCb.
 > 
 > E.g., it's possible to take a look at log from 
 > https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip
 > 
 > This is what I think happened there:
 > 
 > 1. The master at 127.0.0.1:11032 generated and successfully wrote
 > TSK with id 0 (I0309 04:05:44.679442   558 catalog_manager.cc:3432]
 > Generated new TSK 0).  Later on, re-election happened and it was
 > elected as a leader again, and it generated TSK with id 1 but since
 > there is injected latency prior to writing the key into the table,
 > it failed to write it into the system table due to leadership term
 > change.
 > 
 > 2. Some other leader started its leadership duties but failed to
 > caught up as a leader.
 > 
 > 3. Our former (first) master server became the leader again and it
 > generated and successfully written TSK with id 2 (I0309
 > 04:05:46.812899   668 catalog_manager.cc:3432] Generated new TSK 2)
 > 
 > 4. Right after that leadership changed and other master server ran
 > its ElectedAsLeaderCb and it did not see the latest TSK record in
 > the system table.  Seeing just the record with TSK id 0, it
 > generated and successfully written its new TSK with id 1 (I0309
 > 04:05:47.270205   378 catalog_manager.cc:3432] Generated new TSK 1)
 > 
 > 5. Now, the client has connected to the current master leader which
 > has just generated TSK and made it current (TSK rotation period is
 > 2 seconds).  It got authn token signed by TSK with id 1.
 > 
 > 6. The client tries to execute write operation against the tablet
 > server which has received TSKs with id 0 and 2.  The tablet server
 > cannot see the TSK with id 1 because the new master does not send
 > it in response since the tablet server sends 2 as the latest TSK
 > id.
 > 
 > 7. The tablet server responded with 'Runtime error:
 > ERROR_UNAVAILABLE: Not authorized: authentication token signed with
 > unknown key' while the client tried to negotiate the connection.

Instead of the link to the artifacts it's possible to use 
http://dist-test.cloudera.org//job?job_id=aserbin.1489032335.25361 and retrieve 
the artifacts of the very first failure in the list.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: No

Reply via email to