Indeed, that would be a good client feature, but should also be applied
ONLY when the failed request asked for work. I believe most such failed
requests producing lost tasks actually get an HTTP error 500 rather than
a timeout, perhaps that could also be considered.
With the delays involved in getting client changes adopted, which may be
even longer for 6.12.x than earlier series because so many users dislike
the Manager changes, I hope the simpler and quicker server-side change
using detection in sched_result.cpp will also be implemented. It should
usually be redundant information to the client change if both are
implemented, but reported results are handled before the work request so
there need be no difficulty there.
--
Joe
On 11 Aug 2010 at 10:35, [email protected] wrote:
> I noticed this line:
>
> 8/9/10 22:06:47 s...@home Scheduler request failed: Timeout was reached
>
> If that were used to set a flag sent to the server on the next connection
> (and ONLY the next connection to that project) that a search for missing
> results should be performed, the server could then do the missing results
> search in a much more focused manner than the original method of doing it
> for every single connection. This might reduce the DB costs enough to make
> it work on s...@h.
>
> jm7
>
>
>
> "Josef W. Segur"
> <jse...@westelcom
> .com> To
> Sent by: David Anderson
> <boinc_dev-bounce <[email protected]>,
> [email protected] [email protected]
> u> cc
>
> Subject
> 08/11/2010 08:47 [boinc_dev] Lost tasks
> AM
>
>
>
>
>
>
>
>
>
> January 20, 2008 I started a "resend_lost_results improvement?" thread
> here, which eventually came down to what looked like some practical
> methods to avoid the problem. Unfortunately none have been implemented
> and the s...@h database may now have on the order of 1 million result
> records indicating "in progress" which are not actually so.
>
> In addition to ideas discussed then, I've observed on one of my systems
> a sequence which the logic in sched_result.cpp could easily use to flag
> a need to check what is actually on the host against the database. Here
> are some messages from host 2818173
> <URL: http://setiathome.berkeley.edu/show_host_detail.php?hostid=2818173 >
> and trimmed selections from its web task list:
>
> -----------------------------------------------------------
> 8/9/10 22:01:08 s...@home work fetch resumed by user
> 8/9/10 22:01:39 Resuming network activity
> 8/9/10 22:01:39 s...@home Sending scheduler request: To fetch work.
> 8/9/10 22:01:39 s...@home Reporting 2 completed tasks, requesting new
> tasks
> 8/9/10 22:06:47 s...@home Scheduler request failed: Timeout was reached
> 8/9/10 22:07:47 s...@home Sending scheduler request: To fetch work.
> 8/9/10 22:07:47 s...@home Reporting 2 completed tasks, requesting new
> tasks
> 8/9/10 22:09:25 s...@home Scheduler request completed: got 3 new tasks
> 8/9/10 22:09:28 s...@home Started download of
> ap_06my10af_B4_P0_00047_20100809_26735.wu
> 8/9/10 22:09:28 s...@home Started download of
> ap_06my10af_B4_P0_00039_20100809_26735.wu
> 8/9/10 22:09:30 s...@home Started download of
> ap_12ja10aa_B4_P1_00299_20100720_12222.wu
>
>
>
> tasks for computer 2818173
>
> Task Work unit Sent Status Application
> 1680067243 635381158 10 Aug 2010 2:09:50 UTC In progress Astropulse
> v505
> 1680067092 642626371 10 Aug 2010 2:09:49 UTC In progress Astropulse
> v505
> 1680066966 642626352 10 Aug 2010 2:09:50 UTC In progress Astropulse
> v505
> 1680062556 642625860 10 Aug 2010 2:03:36 UTC In progress Astropulse
> v505
> 1680062552 642515167 10 Aug 2010 2:03:36 UTC In progress Astropulse
> v505
> 1680062479 642625851 10 Aug 2010 2:03:36 UTC In progress Astropulse
> v505
>
>
> Task Work unit Reported Status Application
> 1677856015 641623392 10 Aug 2010 2:03:36 UTC Completed s...@home
> Enhanced
> 1677855960 641623387 10 Aug 2010 2:03:36 UTC Completed s...@home
> Enhanced
> -----------------------------------------------------------
>
> So the first Scheduler request both successfully reported two completions
> and three new Astropulse tasks were "Sent" at 2:03:36 UTC but that reply
> didn't get to my system.
>
> The point is that the next Scheduler request REreported tasks, and that's
> prima facie evidence that my system had not received a previous reply. Of
> course not all requests are accompanied by reported completions, nor do
> all replies "Send" work, but it seems a shame not to act on that evidence.
>
> I of course also preferred the sched_result.cpp logic which used to send
> messages back to the client saying "Completed result %s refused: result
> already reported as success". Changeset 21671 removed that information
> and even logging it on the server now needs debug_handle_results set. It's
> very much like sweeping dirt under the rug.
> --
> Joe
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.