We have had occasional, spasmodic, reports at SETI of large numbers of tasks being inexplicably marked as 'abandoned'.
The first batch occurred between February and May 2013, and are documented in http://setiathome.berkeley.edu/forum_thread.php?id=70946 We only received reports from a small number of users, but several users were affected repeatedly. The best guess we could come up with at the time was that communications delays were interfering in the scheduler RPC sequences, but I never got enough firm evidence to file a bug report. I did, however, track down the actual recording of a task as abandoned to mark_results_over(host) in handle_request.cpp. That function is called in only two places: 387 // If host CPID is present, 388 // scan backwards through this user's hosts, 389 // looking for one with the same host CPID. 390 // If we find one, it means the user detached and reattached. 391 // Use the existing host record, 392 // and mark in-progress results as over. and 410 // One final attempt to locate an existing host record: 411 // scan backwards through this user's hosts, 412 // looking for one with the same host name, 413 // IP address, processor and amount of RAM. 414 // If found, use the existing host record, 415 // and mark in-progress results as over. The trouble with this - especially back in the Spring - was that the hosts *hadn't* detached: the tasks were still present on the machine, and being computed as normal. That got volunteers very annoyed: tasks they had spent many hours computing were rejected by the server as having no scientific value, and were given no credit reward either. Over the last 48 hours, we have had a similar report of abandoned tasks for http://setiathome.berkeley.edu/results.php?hostid=7018660&state=6 The cause is somewhat different - the information posted in http://setiathome.berkeley.edu/forum_thread.php?id=72756 implies an unexpected host re-boot, perhaps while critical files were being updated. But an analysis of the server log for the RPC transaction by host 7018660 at 12 Sep 2013, 3:20:31 UTC might throw some light on the earlier problems too. >From the evidence supplied to me by users in March and April (including by one >well-respected alpha tester), it appears that the server message log entries 396 log_messages.printf(MSG_CRITICAL, 397 "[HOST#%d] [USER#%d] User has another host with same CPID.\n", and/or 422 log_messages.printf(MSG_NORMAL, 423 "[HOST#%d] [USER#%d] Found similar existing host for this user - assigned.\n", are not always appropriate and might be misleading. The first one, certainly - which appeared to disproportionately affect users in third-world countries with unreliable internet connections - was perceived to be a slap in the face for users who had gone to considerable trouble and expense to contribute to scientific research through BOINC. Oh, and it would be nice if critical BOINC data files like account_[project].xml weren't so vulnerable to badly-timed reboots or crashes. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
