Re: [boinc_dev] Spontaneous "Client detached"

Josef W. Segur Mon, 27 Jul 2009 22:05:15 -0700

Thanks, David. I'm not even sure the problem occurs more frequently
when there's high DB load, users may simply be more likely to notice
the problem at such times. In any case the test also handles possible
client or communication errors.


Meanwhile, I had another reply which didn't make it to the list, from
Jonathan Hoser of BoincSIMAP. I've included it below for reference, and
another s...@h participant linked the thread which gives even more detail:

http://boinc.bio.wzw.tum.de/boincsimap/forum/viewtopic.php?f=3&t=853

Included in that are two logs from Scheduler processes, one ran for over
30000 seconds, the other over 14000. Both concluded that the RPC seqno
was too low. That check comes after a successful host and user lookup,
either those or something else obviously took too long. The client would
have timed out the connection after 300 seconds and done additional
requests as needed. Perhaps the Scheduler should check how much time has
passed since it started to handle the request before deciding:

// If the seqno from the host is less than what we expect,      
// the user must have copied the state file to a different host.

I haven't checked all the details, I suppose that when the client times
out and closes the connection that should somehow cause the Scheduler
process to be closed. That clearly didn't happen for those cases, and
maybe that points to something unique to MACs and BOINC 6.2.18, though
that could just have been coincidence too. The most recent cases at
s...@h are on Windows hosts.
-- 
                                                           Joe



On 26 Jul 2009 at 21:51, David wrote:

> I'll put in that test.
> However it's not clear that's the problem;
> high DB load would not cause host lookup by ID to fail.
> -- David



On 27 Jul 2009 at 9:54, Jonathan wrote:

> Hi!
> Just read this.. and have a few questions/ideas:
> 
> Is this happening throughout every kind of computer?
> I'm asking, because here with SIMAP, we see one such a happening every 
> now and then,
> with a Mac connecting to the scheduler and then getting a reattachment.
> 
> We have traced it to a *very* long running scheduler request that seems 
> to be on hold for several hours,
> during that time, the Mac makes some more scheduler requests, increasing 
> the request_sequence_id;
> Then suddenly, (why, we don't know or understand - or even can guess) 
> the Mac pics up the long standing scheduler request,
> which suddenly returns and complains about not-in-order 
> request_sequence_ids, effectively detaching the host.
> 
> Might this be the case on Seti too?
> 
> However, I don't see a chance for the seti-guys to track this down, 
> because it took me - with our considerably smaller 
> database/hostcount/workload -
> about three month to track it down. Though this lengthy it was only 
> 'cause of our cyclic work-distribution scheme.
> 
> Best
> -Jonathan
> from the BoincSIMAP team


 
> Josef W. Segur wrote:
> > Every once in awhile, a user will note in the s...@h Number Crunching forum
> > that work being processed on a host has been marked "Client detached" in
> > the task lists. It happens often enough to be a familiar complaint,
> > seemingly most often when the BOINC database is most heavily loaded. As
> > a pure guess, finding a host by it's hostid might fail under those
> > circumstances.
> > 
> > Looking through the authenticate_user() code in handle_request.cpp, the
> > logic when hostid is either missing or invalid but the host is found
> > based on host_cpid assumes that's sufficient to assume the host has
> > detached and reattached. I suggest adding a test for
> > (g_request->other_results.size() == 0). It should be unconditionally true
> > for an actual detach/reattach, otherwise if it's true it does no harm to
> > mark any results "Client detached" because the host didn't know of them.
> > 
> > The same test could also be used when both hostid and host_cpid have failed
> > to locate the host, but find_host_by_other() finds a close match. I don't
> > know whether that's a good idea, that matching seems adequate but having
> > failed the "id" methods seems to imply a sufficiently bad situation that a
> > fresh start may be needed. Judgement call, I guess.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Spontaneous "Client detached"

Reply via email to