[ksh93-integration-discuss] [tools-discuss] ksh93 sleep

Roland Mainz Thu, 19 Feb 2009 17:53:12 +0100

James Carlson wrote:
> Milan Jurik writes:
> > V st, 18. 02. 2009 v 18:56, James Carlson p,Am$,1!!(Be:
> > > Nobody knows what the problem is (it hasn't been root-caused yet), or
> > > if any of the later fixes address it, but the workaround is to remove
> > > the ksh93 /usr/bin/sleep from the system and replace with the old
> > > binary.
> >
> > Which is not easy to analyze based on that particular test suite. But
> > with sleep involved and that test suite playing with NIS+ server and
> > nscd it is probably CR 6807422, which is based on sleep "~ regression",
> > CR 6807179. I will try to find some time to look at it later.
> 
> Yes, that seems like a fair bet.  In any event, meem's original point
> was not that someone needs to fix every one of these cases
> individually (obviously, they do need fixes), but instead the point is
> that as changes go, this *particular* one (changing /usr/bin/sleep)
> had a substantial blast radius.


Which wasn't intended but none of the 12-month testing period for
ksh93-integration update1 did show any of the problems - see below...

> Given the cost of the effects and the
> still-not-certain-we'v-fixed-it-entirely state, I think he has a good
> argument for a reversion until the project team can show that the
> change actually is safe.

I, April and others ran the ksh93 test suite, the VSC test suite,
building OS/Net, SFWNV, KDE, FOX, X.org X11R6.8+lots of Sun-internal
stuff with /sbin/sh==ksh93, did manual testing and we had binary test
tarballs avaiable for more than a year and pestered lots of people
within Sun to test them and fixed all bugreports we got (most of the
time between the original ksh93-integration putback and update1 were
spend with testing+bugfix cycles, over and over again). So far all
requirements (and lots of stuff beyond that) for the putback+RTI were
met (and we double-checked that).

If you now request to remove the "one change" then it may be nice to
answer the question how should we test this if we can't get bug reports
anymore ? And how should I test all possible test suites which are
locked behind Sun's firewall ? Technically your request to "show that
the change is actually safe" cannot be 100.0%  done since noone (not
even Sun employees) has/have full access to all possible test suites and
usage scenarios used within Sun. There are always some bugs or nits
after an integration and we're trying to act _responsible_ and clean
them up as fast as possible (the last cycle time from bug report here to
bugfix was less than 18 hours and we would've been _much_ faster if this
email thread would be a bit smaller (until now 91 emails in my InBox -
which acts as some kind of DDOS against me (that's why I skipped the
test suite module from the upcoming bugfix putback and deferred it to
the next one))).

> As for the "moratorium" idea, the general problem is that the system
> is supposed to maintain FCS quality all the time.

... and we try to met&&honor this rule at all cost (guess why we needed
14 months between the two ksh93 putbacks ?). So far after all the
testing ksh93-integration update1 got the "green light" for integration
into OS/Net. You really can't claim that we violated that rule.

> There's no such
> concept in ON as debugging something into existence.  Instead, the
> rule is supposed to be simple: it's either ready, or it's not, and if
> it's not, it gets yanked until it is.  The reasoning behind that is
> here:
> 
>   http://opensolaris.org/os/community/on/dev_solaris/qual_death_spiral/
> 
> Thus, I think meem's questions are on target and entirely fair,
> despite the grousing to the contrary.  The question he's been raising
> is whether it's time to undo that _one_ change (not the entire wad)
> until the change can be shown to be safe.

And how should we proceed then with ksh93-integration update2 ? We can't
putback without the wrapper and the wrapper must be "mature" by that
time.

> As for the answer to the question, I think that this latest problem
> ought to be the last straw: if the fix works and can be quickly tested
> in all the scenarios that have broken, then ok.  If it fails anywhere
> for any reason, then yank it until there is a well-tested fix.
> Enough's enough.  ksh93 isn't the only project in the gate.

Right. But without being able to test anything we can't make progress.
As I said it is _impossible_ to catch up with a codebase which received
bugfixes since 20 years in one single step.


Finally: I really would like to work on the code and not on a giant
stream emails. What I really would prefer right now is that people here
come up with bug reports (including steps to help us to reproduce
problems (like Casper does) and not bomb the list with further
discussions about the wrapper architecture behind /usr/bin/sleep. 

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)

[ksh93-integration-discuss] [tools-discuss] ksh93 sleep

Reply via email to