The next part of the story... This morning as I look I have two Seti Beta downloads in Retry x:xx:xxx and three download pending. Uploads are working fine for for both projects.
Quick excerpt from the log. 7/23/2009 9:35:56 AM s...@home Beta Test Reporting 8 completed tasks, requesting new tasks for GPU 7/23/2009 9:35:58 AM [http_xfer_debug] HTTP: wrote 2742 bytes 7/23/2009 9:35:58 AM [http_xfer_debug] HTTP: wrote 2872 bytes 7/23/2009 9:35:58 AM [http_xfer_debug] HTTP: wrote 1436 bytes 7/23/2009 9:35:58 AM [http_xfer_debug] HTTP: wrote 1436 bytes 7/23/2009 9:35:58 AM [http_xfer_debug] HTTP: wrote 1436 bytes 7/23/2009 9:35:58 AM [http_xfer_debug] HTTP: wrote 2412 bytes 7/23/2009 9:36:01 AM s...@home Beta Test Scheduler request completed: got 1 new tasks 7/23/2009 9:48:04 AM s...@home Beta Test [error] File 09mr09aa.11273.7434.3.13.192 has wrong size: expected 375335, got 0 7/23/2009 9:48:04 AM s...@home Beta Test Started download of 09mr09aa.11273.7434.3.13.192 7/23/2009 9:48:04 AM s...@home Beta Test [file_xfer_debug] URL: http://boinc2.ssl.berkeley.edu/beta/download/32c/09mr09aa.11273.7434.3.13.192 7/23/2009 9:48:04 AM s...@home Beta Test [error] File 09mr09aa.11273.157778.3.13.66 has wrong size: expected 375335, got 0 7/23/2009 9:48:04 AM s...@home Beta Test Started download of 09mr09aa.11273.157778.3.13.66 7/23/2009 9:48:04 AM s...@home Beta Test [file_xfer_debug] URL: http://boinc2.ssl.berkeley.edu/beta/download/32a/09mr09aa.11273.157778.3.13.66 7/23/2009 9:48:04 AM [http_xfer_debug] HTTP: wrote 337 bytes 7/23/2009 9:48:04 AM [http_xfer_debug] HTTP: wrote 336 bytes 7/23/2009 9:48:05 AM s...@home Beta Test [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval -184 7/23/2009 9:48:05 AM s...@home Beta Test [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval -184 7/23/2009 9:48:05 AM s...@home Beta Test [file_xfer_debug] file transfer status -184 7/23/2009 9:48:05 AM s...@home Beta Test Temporarily failed download of 09mr09aa.11273.7434.3.13.192: HTTP error 7/23/2009 9:48:05 AM s...@home Beta Test [file_xfer_debug] project-wide xfer delay for 667.000336 sec 7/23/2009 9:48:05 AM s...@home Beta Test Backing off 2 hr 57 min 43 sec on download of 09mr09aa.11273.7434.3.13.192 7/23/2009 9:48:05 AM s...@home Beta Test [file_xfer_debug] file transfer status -184 7/23/2009 9:48:05 AM s...@home Beta Test Temporarily failed download of 09mr09aa.11273.157778.3.13.66: HTTP error 7/23/2009 9:48:05 AM s...@home Beta Test [file_xfer_debug] project-wide xfer delay for 3824.977569 sec 7/23/2009 9:48:05 AM s...@home Beta Test Backing off 3 hr 19 min 25 sec on download of 09mr09aa.11273.157778.3.13.66 7/23/2009 9:52:23 AM s...@home Beta Test Computation for task 09mr09aa.11273.157778.3.13.89_1 finished 7/23/2009 9:52:23 AM s...@home Beta Test Starting 09mr09aa.11273.157778.3.13.77_0 7/23/2009 9:52:23 AM s...@home Beta Test Starting task 09mr09aa.11273.157778.3.13.77_0 using setiathome_enhanced version 608 7/23/2009 9:52:25 AM s...@home Beta Test Started upload of 09mr09aa.11273.157778.3.13.89_1_0 7/23/2009 9:52:25 AM s...@home Beta Test [file_xfer_debug] URL: http://setiboincdata.ssl.berkeley.edu/beta_cgi/file_upload_handler 7/23/2009 9:52:26 AM [http_xfer_debug] HTTP: wrote 93 bytes 7/23/2009 9:52:27 AM s...@home Beta Test [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 7/23/2009 9:52:27 AM s...@home Beta Test [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply> 7/23/2009 9:52:27 AM s...@home Beta Test [file_xfer_debug] parsing status: 0 7/23/2009 9:52:28 AM [http_xfer_debug] HTTP: wrote 64 bytes 7/23/2009 9:52:29 AM s...@home Beta Test [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 7/23/2009 9:52:29 AM s...@home Beta Test [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply> 7/23/2009 9:52:29 AM s...@home Beta Test [file_xfer_debug] parsing status: 0 7/23/2009 9:52:29 AM s...@home Beta Test [file_xfer_debug] file transfer status 0 7/23/2009 9:52:29 AM s...@home Beta Test Finished upload of 09mr09aa.11273.157778.3.13.89_1_0 7/23/2009 9:52:29 AM s...@home Beta Test [file_xfer_debug] Throughput 20444 bytes/sec 7/23/2009 10:09:21 AM nque...@home Project Computation for task Nq26_06_20_23_15_09_0 finished 7/23/2009 10:09:23 AM nque...@home Project Started upload of Nq26_06_20_23_15_09_0_0 7/23/2009 10:09:23 AM nque...@home Project [file_xfer_debug] URL: http://nqueens.ing.udec.cl/NQueens_cgi/file_upload_handler 7/23/2009 10:09:25 AM [http_xfer_debug] HTTP: wrote 64 bytes 7/23/2009 10:09:25 AM nque...@home Project [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 7/23/2009 10:09:25 AM nque...@home Project [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply> 7/23/2009 10:09:25 AM nque...@home Project [file_xfer_debug] parsing status: 0 7/23/2009 10:09:25 AM nque...@home Project [file_xfer_debug] file transfer status 0 7/23/2009 10:09:25 AM nque...@home Project Finished upload of Nq26_06_20_23_15_09_0_0 7/23/2009 10:09:25 AM nque...@home Project [file_xfer_debug] Throughput 97 bytes/sec 7/23/2009 10:15:54 AM s...@home Beta Test Computation for task 09mr09aa.11273.157778.3.13.77_0 finished 7/23/2009 10:15:54 AM s...@home Beta Test Starting 09mr09aa.11273.157778.3.13.87_0 7/23/2009 10:15:54 AM s...@home Beta Test Starting task 09mr09aa.11273.157778.3.13.87_0 using setiathome_enhanced version 608 7/23/2009 10:15:56 AM s...@home Beta Test Started upload of 09mr09aa.11273.157778.3.13.77_0_0 7/23/2009 10:15:56 AM s...@home Beta Test [file_xfer_debug] URL: http://setiboincdata.ssl.berkeley.edu/beta_cgi/file_upload_handler 7/23/2009 10:15:57 AM [http_xfer_debug] HTTP: wrote 93 bytes 7/23/2009 10:15:57 AM s...@home Beta Test [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 7/23/2009 10:15:57 AM s...@home Beta Test [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply> 7/23/2009 10:15:57 AM s...@home Beta Test [file_xfer_debug] parsing status: 0 7/23/2009 10:15:58 AM [http_xfer_debug] HTTP: wrote 64 bytes 7/23/2009 10:15:58 AM s...@home Beta Test [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 7/23/2009 10:15:58 AM s...@home Beta Test [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply> 7/23/2009 10:15:58 AM s...@home Beta Test [file_xfer_debug] parsing status: 0 7/23/2009 10:15:58 AM s...@home Beta Test [file_xfer_debug] file transfer status 0 7/23/2009 10:15:58 AM s...@home Beta Test Finished upload of 09mr09aa.11273.157778.3.13.77_0_0 7/23/2009 10:15:58 AM s...@home Beta Test [file_xfer_debug] Throughput 78530 bytes/sec At 03:44 PM 7/22/2009 -0700, Lynn W. Taylor wrote: >Sounds reasonable to me. Not sure if that is what was intended. > >Al Reust wrote: >>Seti Beta - NQueens (running backup project) 6.6.38 Aries >>http://setiathome.berkeley.edu/beta/show_host_detail.php?hostid=22789 >>47 Cuda stuck in "project backoff" >>[s...@home Beta Test] Backing off 2 hr 6 min 34 sec on upload of >>09mr09aa.11273.1299.3.13.117_1_0 >>I could select one result and click Retry Now, it would pop 2 into >>Uploading that errored out and extended the retry wait. >>I clicked Retry Now again on the one below that, 2 popped into uploading >>and one got through which started the next one in the Queue. >>I clicked Retry Now again on the one below that, 2 popped into uploading >>both timed out which extended the timer. >>Okay what next??? >>Do Network communications >>The ones that had not been extended immediately went into Upload Pending. >>It so happens it was coincident with Eric opening the pipe. The first two >>got through okay, one of the next pair failed and went into extended >>retry. The Next started uploading and got through. >>It continued until about half got through and what was left were those >>waiting for the extended retry. >>Okay pick the top one and click Retry Now. One got through and the other >>went into extended retry. >>Then the next set of 2 both got through. >>Do Network Communications >>I presume that as the last 2 both were met with success, the uploads all >>went to Upload Pending and started cycling through the remaining results. >>Of the 27 remaining 2/3rds were successful and the remaining went into a >>extended retry. >>As Long as there was One Success after a failure it would proceed to the >>next. IF there was no success (2 failures) it stopped. So a Host with a >>very large number could be stuck until they get a clear connection. Then >>they would still have those that had failed and extended the retry time. >> >>At 02:29 PM 7/22/2009 -0700, Lynn W. Taylor wrote: >>>If there is a bug, it may be in how fast the project-wide delay grows. >>> >>>I've only had a couple of cycles and it's already 2.7 hours. >>> >>>If there are unwarranted bug reports, it is because uploads sit at >>>"upload pending" and there is no indication in the GUI that there is a >>>project wide delay, or when it will finally be lifted. >>> >>>-- Lynn >>> >>>David Anderson wrote: >>> > I just checked in the change you describe. >>> > Actually it was added 4 years ago, >>> > but was backed out because there were reports that it wasn't working >>> right. >>> > >>> > This time I added some <file_xfer_debug>-enabled messages, >>> > so we should be able to track down any problems. >>> > >>> > -- David >>> > >>> > Lynn W. Taylor wrote: >>> >> Hi All, >>> >> >>> >> I've been watching s...@home during their current challenges, and I >>> >> think I see something that can be optimized a bit. >>> >> >>> >> The problem: >>> >> >>> >> Take a very fast machine, something with lots of RAM, a couple of i7 >>> >> processors and several high-end CUDA cards -- a machine that can chew >>> >> through work units at an amazing rate. >>> >> >>> >> It has a big cache. >>> >> >>> >> As work is completed, each work unit goes into the transfer queue. >>> >> >>> >> BOINC sends each one, and if the upload server is unreachable, each work >>> >> unit is retried based on the back-off algorithm. >>> >> >>> >> If an upload fails, that information does not affect the other running >>> >> upload timers. >>> >> >>> >> In other words, this mega-fast machine could have a lot (hundreds) of >>> >> pending uploads, and tries every one every few hours. >>> >> >>> >> I see two issues: >>> >> >>> >> 1) The most important work (the one with the earliest deadline) may be >>> >> one of the ones that tries the least (longest interval). >>> >> >>> >> 2) Retrying 100's of units adds load to the servers. 180,000-odd >>> >> clients trying to reach one or two machines at SETI. >>> >> >>> >> Optimization: >>> >> >>> >> On a failed upload, BOINC could basically treat that as if every upload >>> >> timed out. That would reduce the number of attempted uploads from all >>> >> clients, reducing the load on the servers. >>> >> >>> >> Of course, since the odds of a successful upload is just about zero for >>> >> a work unit that isn't retried, by itself this is a bad idea. >>> >> >>> >> So, when any retry timer runs out, instead of retrying that WU, retry >>> >> the one with the earliest deadline -- the one at the highest risk. >>> >> >>> >> As the load drops, work would continue to be uploaded in deadline order >>> >> until everything is caught up. >>> >> >>> >> I know a project can have different upload servers for different >>> >> applications, or for load balancing, or whatever, so this would only >>> >> apply to work going to the same server. >>> >> >>> >> The same idea could apply to downloads as well. Does the BOINC client >>> >> get the deadline from the scheduler?? >>> >> >>> >> Now, if I can figure out how to get a BOINC development environment >>> >> going, and unless it's just a stupid idea, I'll be glad to take a shot >>> >> at the code. >>> >> >>> >> Comments? >>> >> >>> >> -- Lynn >>> >> _______________________________________________ >>> >> boinc_dev mailing list >>> >> [email protected] >>> >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> >> To unsubscribe, visit the above URL and >>> >> (near bottom of page) enter your email address. >>> > >>> > _______________________________________________ >>> > boinc_dev mailing list >>> > [email protected] >>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> > To unsubscribe, visit the above URL and >>> > (near bottom of page) enter your email address. >>> > >>>_______________________________________________ >>>boinc_dev mailing list >>>[email protected] >>>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>>To unsubscribe, visit the above URL and >>>(near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
