On Sat, Aug 09, 2008 at 07:33:37AM -0700, Richard Elling wrote: > isaac wrote: > > Mike, > > > > This is great! > > > > Also looking for data encompassing newer types of device failure > > conditions & possibly new customer experiences. > > > > We do collect such information from Explorer, but I'm not aware of > anyone performing analysis on the data. One issue that makes such > analysis more difficult is that enhancements and changes to the > diagnosis engines is a continuous process, so if you look in the field > data for a specific diagnosis, then it is difficult to determine the actual > installed base. For the MPR work, which rolled out many years ago, > we have a good feeling for the installed base, and it made a rather > noticeable change in the service rates. For less frequent diagnoses, > it will be more difficult to see the change in service rates. For > example, I looked at CPU retirements about a year ago and for the > entire the Niagara1 systems for which we have data (thousands) we > saw a grand total of 2 offlined (psrinfo) CPUs. Incidentally, the > root cause for both was the L2 cache, not the cores themselves. > This represents a very low incidence rate, which is a good thing. > -- richard
This is an excellent point, and one which you also want to drive home to the customer. FMA is really about three things: 1. The system should do everything possible to keep itself running. Sun has invented new technologies for that, unique in the industry, like Memory Page Retire, and has a large family of other things like cpu and core offline, i/o retire, and so forth. 2. When something is broken, you get a clear service interaction: i.e. what to replace. Not a spew of error messages. This means fastest diagnosis, and a high likelihood of the correct repair. 3. Sun has a data-driven feedback loop to monitor the quality of its products in the field. So even where we're doing well, as in Richard's Niagara1 example above, we have the infrastructure in place to get the data we need should anything ever go wrong. As an example, suppose a component supplier delivered a bad lot of parts -- with the FMA data feed in place we could quickly identify the common thread amongst a set of problems in the field and then help affected customers. So even when things seem happy, #3 is critical to ensuring the best quality experience for our customers: it's like the disaster insurance policy you hope you never need. -Mike -- Mike Shapiro, Sun Microsystems Fishworks. blogs.sun.com/mws/ _______________________________________________ fm-discuss mailing list fm-discuss@opensolaris.org