The next part of the story...

This morning as I look I have two Seti Beta downloads in Retry x:xx:xxx and 
three download pending. Uploads are working fine for for both projects.

Quick excerpt from the log.

7/23/2009 9:35:56 AM    s...@home Beta Test     Reporting 8 completed 
tasks, requesting new tasks for GPU
7/23/2009 9:35:58 AM            [http_xfer_debug] HTTP: wrote 2742 bytes
7/23/2009 9:35:58 AM            [http_xfer_debug] HTTP: wrote 2872 bytes
7/23/2009 9:35:58 AM            [http_xfer_debug] HTTP: wrote 1436 bytes
7/23/2009 9:35:58 AM            [http_xfer_debug] HTTP: wrote 1436 bytes
7/23/2009 9:35:58 AM            [http_xfer_debug] HTTP: wrote 1436 bytes
7/23/2009 9:35:58 AM            [http_xfer_debug] HTTP: wrote 2412 bytes
7/23/2009 9:36:01 AM    s...@home Beta Test     Scheduler request 
completed: got 1 new tasks
7/23/2009 9:48:04 AM    s...@home Beta Test     [error] File 
09mr09aa.11273.7434.3.13.192 has wrong size: expected 375335, got 0
7/23/2009 9:48:04 AM    s...@home Beta Test     Started download of 
09mr09aa.11273.7434.3.13.192
7/23/2009 9:48:04 AM    s...@home Beta Test     [file_xfer_debug] URL: 
http://boinc2.ssl.berkeley.edu/beta/download/32c/09mr09aa.11273.7434.3.13.192
7/23/2009 9:48:04 AM    s...@home Beta Test     [error] File 
09mr09aa.11273.157778.3.13.66 has wrong size: expected 375335, got 0
7/23/2009 9:48:04 AM    s...@home Beta Test     Started download of 
09mr09aa.11273.157778.3.13.66
7/23/2009 9:48:04 AM    s...@home Beta Test     [file_xfer_debug] URL: 
http://boinc2.ssl.berkeley.edu/beta/download/32a/09mr09aa.11273.157778.3.13.66
7/23/2009 9:48:04 AM            [http_xfer_debug] HTTP: wrote 337 bytes
7/23/2009 9:48:04 AM            [http_xfer_debug] HTTP: wrote 336 bytes
7/23/2009 9:48:05 AM    s...@home Beta Test     [file_xfer_debug] 
FILE_XFER_SET::poll(): http op done; retval -184
7/23/2009 9:48:05 AM    s...@home Beta Test     [file_xfer_debug] 
FILE_XFER_SET::poll(): http op done; retval -184
7/23/2009 9:48:05 AM    s...@home Beta Test     [file_xfer_debug] file 
transfer status -184
7/23/2009 9:48:05 AM    s...@home Beta Test     Temporarily failed download 
of 09mr09aa.11273.7434.3.13.192: HTTP error
7/23/2009 9:48:05 AM    s...@home Beta Test     [file_xfer_debug] 
project-wide xfer delay for 667.000336 sec
7/23/2009 9:48:05 AM    s...@home Beta Test     Backing off 2 hr 57 min 43 
sec on download of 09mr09aa.11273.7434.3.13.192
7/23/2009 9:48:05 AM    s...@home Beta Test     [file_xfer_debug] file 
transfer status -184
7/23/2009 9:48:05 AM    s...@home Beta Test     Temporarily failed download 
of 09mr09aa.11273.157778.3.13.66: HTTP error
7/23/2009 9:48:05 AM    s...@home Beta Test     [file_xfer_debug] 
project-wide xfer delay for 3824.977569 sec
7/23/2009 9:48:05 AM    s...@home Beta Test     Backing off 3 hr 19 min 25 
sec on download of 09mr09aa.11273.157778.3.13.66
7/23/2009 9:52:23 AM    s...@home Beta Test     Computation for task 
09mr09aa.11273.157778.3.13.89_1 finished
7/23/2009 9:52:23 AM    s...@home Beta Test     Starting 
09mr09aa.11273.157778.3.13.77_0
7/23/2009 9:52:23 AM    s...@home Beta Test     Starting task 
09mr09aa.11273.157778.3.13.77_0 using setiathome_enhanced version 608
7/23/2009 9:52:25 AM    s...@home Beta Test     Started upload of 
09mr09aa.11273.157778.3.13.89_1_0
7/23/2009 9:52:25 AM    s...@home Beta Test     [file_xfer_debug] URL: 
http://setiboincdata.ssl.berkeley.edu/beta_cgi/file_upload_handler
7/23/2009 9:52:26 AM            [http_xfer_debug] HTTP: wrote 93 bytes
7/23/2009 9:52:27 AM    s...@home Beta Test     [file_xfer_debug] 
FILE_XFER_SET::poll(): http op done; retval 0
7/23/2009 9:52:27 AM    s...@home Beta Test     [file_xfer_debug] parsing 
upload response: 
<data_server_reply>    <status>0</status> 
<file_size>0</file_size></data_server_reply>
7/23/2009 9:52:27 AM    s...@home Beta Test     [file_xfer_debug] parsing 
status: 0
7/23/2009 9:52:28 AM            [http_xfer_debug] HTTP: wrote 64 bytes
7/23/2009 9:52:29 AM    s...@home Beta Test     [file_xfer_debug] 
FILE_XFER_SET::poll(): http op done; retval 0
7/23/2009 9:52:29 AM    s...@home Beta Test     [file_xfer_debug] parsing 
upload response: <data_server_reply>    <status>0</status></data_server_reply>
7/23/2009 9:52:29 AM    s...@home Beta Test     [file_xfer_debug] parsing 
status: 0
7/23/2009 9:52:29 AM    s...@home Beta Test     [file_xfer_debug] file 
transfer status 0
7/23/2009 9:52:29 AM    s...@home Beta Test     Finished upload of 
09mr09aa.11273.157778.3.13.89_1_0
7/23/2009 9:52:29 AM    s...@home Beta Test     [file_xfer_debug] 
Throughput 20444 bytes/sec
7/23/2009 10:09:21 AM   nque...@home Project    Computation for task 
Nq26_06_20_23_15_09_0 finished
7/23/2009 10:09:23 AM   nque...@home Project    Started upload of 
Nq26_06_20_23_15_09_0_0
7/23/2009 10:09:23 AM   nque...@home Project    [file_xfer_debug] URL: 
http://nqueens.ing.udec.cl/NQueens_cgi/file_upload_handler
7/23/2009 10:09:25 AM           [http_xfer_debug] HTTP: wrote 64 bytes
7/23/2009 10:09:25 AM   nque...@home Project    [file_xfer_debug] 
FILE_XFER_SET::poll(): http op done; retval 0
7/23/2009 10:09:25 AM   nque...@home Project    [file_xfer_debug] parsing 
upload response: <data_server_reply>    <status>0</status></data_server_reply>
7/23/2009 10:09:25 AM   nque...@home Project    [file_xfer_debug] parsing 
status: 0
7/23/2009 10:09:25 AM   nque...@home Project    [file_xfer_debug] file 
transfer status 0
7/23/2009 10:09:25 AM   nque...@home Project    Finished upload of 
Nq26_06_20_23_15_09_0_0
7/23/2009 10:09:25 AM   nque...@home Project    [file_xfer_debug] 
Throughput 97 bytes/sec
7/23/2009 10:15:54 AM   s...@home Beta Test     Computation for task 
09mr09aa.11273.157778.3.13.77_0 finished
7/23/2009 10:15:54 AM   s...@home Beta Test     Starting 
09mr09aa.11273.157778.3.13.87_0
7/23/2009 10:15:54 AM   s...@home Beta Test     Starting task 
09mr09aa.11273.157778.3.13.87_0 using setiathome_enhanced version 608
7/23/2009 10:15:56 AM   s...@home Beta Test     Started upload of 
09mr09aa.11273.157778.3.13.77_0_0
7/23/2009 10:15:56 AM   s...@home Beta Test     [file_xfer_debug] URL: 
http://setiboincdata.ssl.berkeley.edu/beta_cgi/file_upload_handler
7/23/2009 10:15:57 AM           [http_xfer_debug] HTTP: wrote 93 bytes
7/23/2009 10:15:57 AM   s...@home Beta Test     [file_xfer_debug] 
FILE_XFER_SET::poll(): http op done; retval 0
7/23/2009 10:15:57 AM   s...@home Beta Test     [file_xfer_debug] parsing 
upload response: 
<data_server_reply>    <status>0</status> 
<file_size>0</file_size></data_server_reply>
7/23/2009 10:15:57 AM   s...@home Beta Test     [file_xfer_debug] parsing 
status: 0
7/23/2009 10:15:58 AM           [http_xfer_debug] HTTP: wrote 64 bytes
7/23/2009 10:15:58 AM   s...@home Beta Test     [file_xfer_debug] 
FILE_XFER_SET::poll(): http op done; retval 0
7/23/2009 10:15:58 AM   s...@home Beta Test     [file_xfer_debug] parsing 
upload response: <data_server_reply>    <status>0</status></data_server_reply>
7/23/2009 10:15:58 AM   s...@home Beta Test     [file_xfer_debug] parsing 
status: 0
7/23/2009 10:15:58 AM   s...@home Beta Test     [file_xfer_debug] file 
transfer status 0
7/23/2009 10:15:58 AM   s...@home Beta Test     Finished upload of 
09mr09aa.11273.157778.3.13.77_0_0
7/23/2009 10:15:58 AM   s...@home Beta Test     [file_xfer_debug] 
Throughput 78530 bytes/sec




At 03:44 PM 7/22/2009 -0700, Lynn W. Taylor wrote:
>Sounds reasonable to me.  Not sure if that is what was intended.
>
>Al Reust wrote:
>>Seti Beta - NQueens (running backup project) 6.6.38 Aries
>>http://setiathome.berkeley.edu/beta/show_host_detail.php?hostid=22789
>>47 Cuda stuck in "project backoff"
>>[s...@home Beta Test] Backing off 2 hr 6 min 34 sec on upload of 
>>09mr09aa.11273.1299.3.13.117_1_0
>>I could select one result and click Retry Now, it would pop 2 into 
>>Uploading that errored out and extended the retry wait.
>>I clicked Retry Now again on the one below that, 2 popped into uploading 
>>and one got through which started the next one in the Queue.
>>I clicked Retry Now again on the one below that, 2 popped into uploading 
>>both timed out which extended the timer.
>>Okay what next???
>>Do Network communications
>>The ones that had not been extended immediately went into Upload Pending. 
>>It so happens it was coincident with Eric opening the pipe. The first two 
>>got through okay, one of the next pair failed and went into extended 
>>retry. The Next started uploading and got through.
>>It continued until about half got through and what was left were those 
>>waiting for the extended retry.
>>Okay pick the top one and click Retry Now. One got through and the other 
>>went into extended retry.
>>Then the next set of 2 both got through.
>>Do Network Communications
>>I presume that as the last 2 both were met with success, the uploads all 
>>went to Upload Pending and started cycling through the remaining results.
>>Of the 27 remaining 2/3rds were successful and the remaining went into a 
>>extended retry.
>>As Long as there was One Success after a failure it would proceed to the 
>>next. IF there was no success (2 failures) it stopped. So a Host with a 
>>very large number could be stuck until they get a clear connection. Then 
>>they would still have those that had failed and extended the retry time.
>>
>>At 02:29 PM 7/22/2009 -0700, Lynn W. Taylor wrote:
>>>If there is a bug, it may be in how fast the project-wide delay grows.
>>>
>>>I've only had a couple of cycles and it's already 2.7 hours.
>>>
>>>If there are unwarranted bug reports, it is because uploads sit at
>>>"upload pending" and there is no indication in the GUI that there is a
>>>project wide delay, or when it will finally be lifted.
>>>
>>>-- Lynn
>>>
>>>David Anderson wrote:
>>> > I just checked in the change you describe.
>>> > Actually it was added 4 years ago,
>>> > but was backed out because there were reports that it wasn't working 
>>> right.
>>> >
>>> > This time I added some <file_xfer_debug>-enabled messages,
>>> > so we should be able to track down any problems.
>>> >
>>> > -- David
>>> >
>>> > Lynn W. Taylor wrote:
>>> >> Hi All,
>>> >>
>>> >> I've been watching s...@home during their current challenges, and I
>>> >> think I see something that can be optimized a bit.
>>> >>
>>> >> The problem:
>>> >>
>>> >> Take a very fast machine, something with lots of RAM, a couple of i7
>>> >> processors and several high-end CUDA cards -- a machine that can chew
>>> >> through work units at an amazing rate.
>>> >>
>>> >> It has a big cache.
>>> >>
>>> >> As work is completed, each work unit goes into the transfer queue.
>>> >>
>>> >> BOINC sends each one, and if the upload server is unreachable, each work
>>> >> unit is retried based on the back-off algorithm.
>>> >>
>>> >> If an upload fails, that information does not affect the other running
>>> >> upload timers.
>>> >>
>>> >> In other words, this mega-fast machine could have a lot (hundreds) of
>>> >> pending uploads, and tries every one every few hours.
>>> >>
>>> >> I see two issues:
>>> >>
>>> >> 1) The most important work (the one with the earliest deadline) may be
>>> >> one of the ones that tries the least (longest interval).
>>> >>
>>> >> 2) Retrying 100's of units adds load to the servers.  180,000-odd
>>> >> clients trying to reach one or two machines at SETI.
>>> >>
>>> >> Optimization:
>>> >>
>>> >> On a failed upload, BOINC could basically treat that as if every upload
>>> >> timed out.  That would reduce the number of attempted uploads from all
>>> >> clients, reducing the load on the servers.
>>> >>
>>> >> Of course, since the odds of a successful upload is just about zero for
>>> >> a work unit that isn't retried, by itself this is a bad idea.
>>> >>
>>> >> So, when any retry timer runs out, instead of retrying that WU, retry
>>> >> the one with the earliest deadline -- the one at the highest risk.
>>> >>
>>> >> As the load drops, work would continue to be uploaded in deadline order
>>> >> until everything is caught up.
>>> >>
>>> >> I know a project can have different upload servers for different
>>> >> applications, or for load balancing, or whatever, so this would only
>>> >> apply to work going to the same server.
>>> >>
>>> >> The same idea could apply to downloads as well.  Does the BOINC client
>>> >> get the deadline from the scheduler??
>>> >>
>>> >> Now, if I can figure out how to get a BOINC development environment
>>> >> going, and unless it's just a stupid idea, I'll be glad to take a shot
>>> >> at the code.
>>> >>
>>> >> Comments?
>>> >>
>>> >> -- Lynn
>>> >> _______________________________________________
>>> >> boinc_dev mailing list
>>> >> [email protected]
>>> >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> >> To unsubscribe, visit the above URL and
>>> >> (near bottom of page) enter your email address.
>>> >
>>> > _______________________________________________
>>> > boinc_dev mailing list
>>> > [email protected]
>>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> > To unsubscribe, visit the above URL and
>>> > (near bottom of page) enter your email address.
>>> >
>>>_______________________________________________
>>>boinc_dev mailing list
>>>[email protected]
>>>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>>To unsubscribe, visit the above URL and
>>>(near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to