Re: [boinc_dev] Optimizing uploads.....

Al Reust Wed, 22 Jul 2009 15:40:55 -0700

Seti Beta - NQueens (running backup project) 6.6.38 Aries
http://setiathome.berkeley.edu/beta/show_host_detail.php?hostid=22789
47 Cuda stuck in "project backoff"
[s...@home Beta Test] Backing off 2 hr 6 min 34 sec on upload of 
09mr09aa.11273.1299.3.13.117_1_0

I could select one result and click Retry Now, it would pop 2 into 
Uploading that errored out and extended the retry wait.
I clicked Retry Now again on the one below that, 2 popped into uploading 
and one got through which started the next one in the Queue.
I clicked Retry Now again on the one below that, 2 popped into uploading 
both timed out which extended the timer.

Okay what next???

Do Network communications
The ones that had not been extended immediately went into Upload Pending. 
It so happens it was coincident with Eric opening the pipe. The first two 
got through okay, one of the next pair failed and went into extended retry. 
The Next started uploading and got through.
It continued until about half got through and what was left were those 
waiting for the extended retry.

Okay pick the top one and click Retry Now. One got through and the other 
went into extended retry.
Then the next set of 2 both got through.

Do Network Communications
I presume that as the last 2 both were met with success, the uploads all 
went to Upload Pending and started cycling through the remaining results.
Of the 27 remaining 2/3rds were successful and the remaining went into a 
extended retry.

As Long as there was One Success after a failure it would proceed to the 
next. IF there was no success (2 failures) it stopped. So a Host with a 
very large number could be stuck until they get a clear connection. Then 
they would still have those that had failed and extended the retry time.

At 02:29 PM 7/22/2009 -0700, Lynn W. Taylor wrote:
>If there is a bug, it may be in how fast the project-wide delay grows.
>
>I've only had a couple of cycles and it's already 2.7 hours.
>
>If there are unwarranted bug reports, it is because uploads sit at
>"upload pending" and there is no indication in the GUI that there is a
>project wide delay, or when it will finally be lifted.
>
>-- Lynn
>
>David Anderson wrote:
> > I just checked in the change you describe.
> > Actually it was added 4 years ago,
> > but was backed out because there were reports that it wasn't working right.
> >
> > This time I added some <file_xfer_debug>-enabled messages,
> > so we should be able to track down any problems.
> >
> > -- David
> >
> > Lynn W. Taylor wrote:
> >> Hi All,
> >>
> >> I've been watching s...@home during their current challenges, and I
> >> think I see something that can be optimized a bit.
> >>
> >> The problem:
> >>
> >> Take a very fast machine, something with lots of RAM, a couple of i7
> >> processors and several high-end CUDA cards -- a machine that can chew
> >> through work units at an amazing rate.
> >>
> >> It has a big cache.
> >>
> >> As work is completed, each work unit goes into the transfer queue.
> >>
> >> BOINC sends each one, and if the upload server is unreachable, each work
> >> unit is retried based on the back-off algorithm.
> >>
> >> If an upload fails, that information does not affect the other running
> >> upload timers.
> >>
> >> In other words, this mega-fast machine could have a lot (hundreds) of
> >> pending uploads, and tries every one every few hours.
> >>
> >> I see two issues:
> >>
> >> 1) The most important work (the one with the earliest deadline) may be
> >> one of the ones that tries the least (longest interval).
> >>
> >> 2) Retrying 100's of units adds load to the servers.  180,000-odd
> >> clients trying to reach one or two machines at SETI.
> >>
> >> Optimization:
> >>
> >> On a failed upload, BOINC could basically treat that as if every upload
> >> timed out.  That would reduce the number of attempted uploads from all
> >> clients, reducing the load on the servers.
> >>
> >> Of course, since the odds of a successful upload is just about zero for
> >> a work unit that isn't retried, by itself this is a bad idea.
> >>
> >> So, when any retry timer runs out, instead of retrying that WU, retry
> >> the one with the earliest deadline -- the one at the highest risk.
> >>
> >> As the load drops, work would continue to be uploaded in deadline order
> >> until everything is caught up.
> >>
> >> I know a project can have different upload servers for different
> >> applications, or for load balancing, or whatever, so this would only
> >> apply to work going to the same server.
> >>
> >> The same idea could apply to downloads as well.  Does the BOINC client
> >> get the deadline from the scheduler??
> >>
> >> Now, if I can figure out how to get a BOINC development environment
> >> going, and unless it's just a stupid idea, I'll be glad to take a shot
> >> at the code.
> >>
> >> Comments?
> >>
> >> -- Lynn
> >> _______________________________________________
> >> boinc_dev mailing list
> >> [email protected]
> >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> >> To unsubscribe, visit the above URL and
> >> (near bottom of page) enter your email address.
> >
> > _______________________________________________
> > boinc_dev mailing list
> > [email protected]
> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> > To unsubscribe, visit the above URL and
> > (near bottom of page) enter your email address.
> >
>_______________________________________________
>boinc_dev mailing list
>[email protected]
>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>To unsubscribe, visit the above URL and
>(near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Optimizing uploads.....

Reply via email to