James Carlson wrote: > Milan Jurik writes: > > V st, 18. 02. 2009 v 18:56, James Carlson p,Am$,1!!(Be: > > > Nobody knows what the problem is (it hasn't been root-caused yet), or > > > if any of the later fixes address it, but the workaround is to remove > > > the ksh93 /usr/bin/sleep from the system and replace with the old > > > binary. > > > > Which is not easy to analyze based on that particular test suite. But > > with sleep involved and that test suite playing with NIS+ server and > > nscd it is probably CR 6807422, which is based on sleep "~ regression", > > CR 6807179. I will try to find some time to look at it later. > > Yes, that seems like a fair bet. In any event, meem's original point > was not that someone needs to fix every one of these cases > individually (obviously, they do need fixes), but instead the point is > that as changes go, this *particular* one (changing /usr/bin/sleep) > had a substantial blast radius.
Which wasn't intended but none of the 12-month testing period for ksh93-integration update1 did show any of the problems - see below... > Given the cost of the effects and the > still-not-certain-we'v-fixed-it-entirely state, I think he has a good > argument for a reversion until the project team can show that the > change actually is safe. I, April and others ran the ksh93 test suite, the VSC test suite, building OS/Net, SFWNV, KDE, FOX, X.org X11R6.8+lots of Sun-internal stuff with /sbin/sh==ksh93, did manual testing and we had binary test tarballs avaiable for more than a year and pestered lots of people within Sun to test them and fixed all bugreports we got (most of the time between the original ksh93-integration putback and update1 were spend with testing+bugfix cycles, over and over again). So far all requirements (and lots of stuff beyond that) for the putback+RTI were met (and we double-checked that). If you now request to remove the "one change" then it may be nice to answer the question how should we test this if we can't get bug reports anymore ? And how should I test all possible test suites which are locked behind Sun's firewall ? Technically your request to "show that the change is actually safe" cannot be 100.0% done since noone (not even Sun employees) has/have full access to all possible test suites and usage scenarios used within Sun. There are always some bugs or nits after an integration and we're trying to act _responsible_ and clean them up as fast as possible (the last cycle time from bug report here to bugfix was less than 18 hours and we would've been _much_ faster if this email thread would be a bit smaller (until now 91 emails in my InBox - which acts as some kind of DDOS against me (that's why I skipped the test suite module from the upcoming bugfix putback and deferred it to the next one))). > As for the "moratorium" idea, the general problem is that the system > is supposed to maintain FCS quality all the time. ... and we try to met&&honor this rule at all cost (guess why we needed 14 months between the two ksh93 putbacks ?). So far after all the testing ksh93-integration update1 got the "green light" for integration into OS/Net. You really can't claim that we violated that rule. > There's no such > concept in ON as debugging something into existence. Instead, the > rule is supposed to be simple: it's either ready, or it's not, and if > it's not, it gets yanked until it is. The reasoning behind that is > here: > > http://opensolaris.org/os/community/on/dev_solaris/qual_death_spiral/ > > Thus, I think meem's questions are on target and entirely fair, > despite the grousing to the contrary. The question he's been raising > is whether it's time to undo that _one_ change (not the entire wad) > until the change can be shown to be safe. And how should we proceed then with ksh93-integration update2 ? We can't putback without the wrapper and the wrapper must be "mature" by that time. > As for the answer to the question, I think that this latest problem > ought to be the last straw: if the fix works and can be quickly tested > in all the scenarios that have broken, then ok. If it fails anywhere > for any reason, then yank it until there is a well-tested fix. > Enough's enough. ksh93 isn't the only project in the gate. Right. But without being able to test anything we can't make progress. As I said it is _impossible_ to catch up with a codebase which received bugfixes since 20 years in one single step. Finally: I really would like to work on the code and not on a giant stream emails. What I really would prefer right now is that people here come up with bug reports (including steps to help us to reproduce problems (like Casper does) and not bomb the list with further discussions about the wrapper architecture behind /usr/bin/sleep. ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;)