I noticed this line:

8/9/10 22:06:47 s...@home   Scheduler request failed: Timeout was reached

If that were used to set a flag sent to the server on the next connection
(and ONLY the next connection to that project) that a search for missing
results should be performed, the server could then do the missing results
search in a much more focused manner than the original method of doing it
for every single connection.  This might reduce the DB costs enough to make
it work on s...@h.

jm7


                                                                           
             "Josef W. Segur"                                              
             <jse...@westelcom                                             
             .com>                                                      To 
             Sent by:                  David Anderson                      
             <boinc_dev-bounce         <[email protected]>,           
             [email protected]         [email protected]          
             u>                                                         cc 
                                                                           
                                                                   Subject 
             08/11/2010 08:47          [boinc_dev] Lost tasks              
             AM                                                            
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




January 20, 2008 I started a "resend_lost_results improvement?" thread
here, which eventually came down to what looked like some practical
methods to avoid the problem. Unfortunately none have been implemented
and the s...@h database may now have on the order of 1 million result
records indicating "in progress" which are not actually so.

In addition to ideas discussed then, I've observed on one of my systems
a sequence which the logic in sched_result.cpp could easily use to flag
a need to check what is actually on the host against the database. Here
are some messages from host 2818173
<URL: http://setiathome.berkeley.edu/show_host_detail.php?hostid=2818173 >
and trimmed selections from its web task list:

-----------------------------------------------------------
8/9/10 22:01:08 s...@home   work fetch resumed by user
8/9/10 22:01:39     Resuming network activity
8/9/10 22:01:39 s...@home   Sending scheduler request: To fetch work.
8/9/10 22:01:39 s...@home   Reporting 2 completed tasks, requesting new
tasks
8/9/10 22:06:47 s...@home   Scheduler request failed: Timeout was reached
8/9/10 22:07:47 s...@home   Sending scheduler request: To fetch work.
8/9/10 22:07:47 s...@home   Reporting 2 completed tasks, requesting new
tasks
8/9/10 22:09:25 s...@home   Scheduler request completed: got 3 new tasks
8/9/10 22:09:28 s...@home   Started download of
ap_06my10af_B4_P0_00047_20100809_26735.wu
8/9/10 22:09:28 s...@home   Started download of
ap_06my10af_B4_P0_00039_20100809_26735.wu
8/9/10 22:09:30 s...@home   Started download of
ap_12ja10aa_B4_P1_00299_20100720_12222.wu



tasks for computer 2818173

Task        Work unit   Sent                     Status       Application
1680067243  635381158   10 Aug 2010 2:09:50 UTC  In progress  Astropulse
v505
1680067092  642626371   10 Aug 2010 2:09:49 UTC  In progress  Astropulse
v505
1680066966  642626352   10 Aug 2010 2:09:50 UTC  In progress  Astropulse
v505
1680062556  642625860   10 Aug 2010 2:03:36 UTC  In progress  Astropulse
v505
1680062552  642515167   10 Aug 2010 2:03:36 UTC  In progress  Astropulse
v505
1680062479  642625851   10 Aug 2010 2:03:36 UTC  In progress  Astropulse
v505


Task        Work unit   Reported                 Status       Application
1677856015  641623392   10 Aug 2010 2:03:36 UTC  Completed    s...@home
Enhanced
1677855960  641623387   10 Aug 2010 2:03:36 UTC  Completed    s...@home
Enhanced
-----------------------------------------------------------

So the first Scheduler request both successfully reported two completions
and three new Astropulse tasks were "Sent" at 2:03:36 UTC but that reply
didn't get to my system.

The point is that the next Scheduler request REreported tasks, and that's
prima facie evidence that my system had not received a previous reply. Of
course not all requests are accompanied by reported completions, nor do
all replies "Send" work, but it seems a shame not to act on that evidence.

I of course also preferred the sched_result.cpp logic which used to send
messages back to the client saying "Completed result %s refused: result
already reported as success". Changeset 21671 removed that information
and even logging it on the server now needs debug_handle_results set. It's
very much like sweeping dirt under the rug.
--
                                                Joe

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.



_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to