I'd love to see something like this built into BOINC apps, and activated by default - one problem is that the app developers who need it most are possibly the ones least likely to enable a non-default option in the API library code.
BUT, I'm worried about basing 'something useful' on fraction_done updates. Some projects issue applications - the AQUA 'ROQS' application is a current case in point - where fraction_done makes huge quantum jumps at infrequent intervals (I'm talking several hours apart). But they are usually running - unless waiting for memory - and they should just be allowed to continue. I suspect they would fail Joe's test. ROQS apps do checkpoint regularly, at the defined intervals, while apparently making no progress. Would a truly stalled app still checkpoint? If not, could adding a second test for a recent checkpoint help to decide the matter? Only if BOTH 'no progress' AND 'no checkpoint' would we decide to give the app a restart kick. ----- Original Message ----- From: "Josef W. Segur" <[email protected]> To: "David Anderson" <[email protected]>; <[email protected]> Sent: Thursday, April 14, 2011 5:54 AM Subject: [boinc_dev] check_progress option > Users find it discouraging to check BOINC and find that an application hasn't > made any progress in hours, and though the eventual cutoff based on > rsc_fpops_bound is needed it is hardly the best we can do. IMO what I'm > suggesting will be an improvement. > > The proposed change provides an option for the timer thread to check whether > a science application seems to still be doing something useful. It's based on > the assumption that correct operation will update the fraction_done > frequently, and if that doesn't happen within a reasonable time the > application should be shut down. That's done like the no heartbeat case, > since at least some cases can be cured by a restart. Even if it's not a > direct help, having BOINC trying to correct the situation ought to be less > discouraging to users. > > I've based the "reasonable time" on the rsc_fpops_est/host_info.p_fpops > runtime approximation divided by 100. Although that's not in any sense > accurate it does provide for old slow systems. If the values to calculate > that time are not available the period is defaulted to 1800 seconds, and on > the short end there's a minimum of 120 seconds. The actual count used is > based on the running_interrupt_count value of course, to exclude time when > the application is suspended. > > I've defaulted the option off so application builds using trunk code won't > have the feature unless a project decides to use it. The changes needed are > in the attached diffs. I've done some testing with builds of the S@H v7 Beta > application including those changes plus code to simulate an unintended > looping condition. That is, the change builds and runs as I intended. > -- > Joe _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
