[boinc_dev] Server scheduler marks tasks as abandoned

Richard Haselgrove Fri, 13 Sep 2013 15:41:43 -0700

We have had occasional, spasmodic, reports at SETI of large numbers of tasks 
being inexplicably marked as 'abandoned'.


The first batch occurred between February and May 2013, and are documented in 
http://setiathome.berkeley.edu/forum_thread.php?id=70946

We only received reports from a small number of users, but several users were 
affected repeatedly. The best guess we could come up with at the time was that 
communications delays were interfering in the scheduler RPC sequences, but I 
never got enough firm evidence to file a bug report.

I did, however, track down the actual recording of a task as abandoned to 
mark_results_over(host) in handle_request.cpp. That function is called in only 
two places:

387        // If host CPID is present, 
388        // scan backwards through this user's hosts, 
389        // looking for one with the same host CPID. 
390        // If we find one, it means the user detached and reattached. 
391        // Use the existing host record, 
392        // and mark in-progress results as over. 

and 

410        // One final attempt to locate an existing host record: 
411        // scan backwards through this user's hosts, 
412        // looking for one with the same host name, 
413        // IP address, processor and amount of RAM. 
414        // If found, use the existing host record, 
415        // and mark in-progress results as over. 

The trouble with this - especially back in the Spring - was that the hosts 
*hadn't* detached: the tasks were still present on the machine, and being 
computed as normal. That got volunteers very annoyed: tasks they had spent many 
hours computing were rejected by the server as having no scientific value, and 
were given no credit reward either.

Over the last 48 hours, we have had a similar report of abandoned tasks for 
http://setiathome.berkeley.edu/results.php?hostid=7018660&state=6

The cause is somewhat different - the information posted in 
http://setiathome.berkeley.edu/forum_thread.php?id=72756 implies an unexpected 
host re-boot, perhaps while critical files were being updated. But an analysis 
of the server log for the RPC transaction by host 7018660 at 12 Sep 2013, 
3:20:31 UTC might throw some light on the earlier problems too.

>From the evidence supplied to me by users in March and April (including by one 
>well-respected alpha tester), it appears that the server message log entries

396                log_messages.printf(MSG_CRITICAL, 
397                    "[HOST#%d] [USER#%d] User has another host with same 
CPID.\n", 

and/or

422            log_messages.printf(MSG_NORMAL, 
423                "[HOST#%d] [USER#%d] Found similar existing host for this 
user - assigned.\n", 

are not always appropriate and might be misleading. The first one, certainly - 
which appeared to disproportionately affect users in third-world countries with 
unreliable internet connections - was perceived to be a slap in the face for 
users who had gone to considerable trouble and expense to contribute to 
scientific research through BOINC.

Oh, and it would be nice if critical BOINC data files like 
account_[project].xml weren't so vulnerable to badly-timed reboots or crashes.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

[boinc_dev] Server scheduler marks tasks as abandoned

Reply via email to