Thanks, David. I'm not even sure the problem occurs more frequently when there's high DB load, users may simply be more likely to notice the problem at such times. In any case the test also handles possible client or communication errors.
Meanwhile, I had another reply which didn't make it to the list, from Jonathan Hoser of BoincSIMAP. I've included it below for reference, and another s...@h participant linked the thread which gives even more detail: http://boinc.bio.wzw.tum.de/boincsimap/forum/viewtopic.php?f=3&t=853 Included in that are two logs from Scheduler processes, one ran for over 30000 seconds, the other over 14000. Both concluded that the RPC seqno was too low. That check comes after a successful host and user lookup, either those or something else obviously took too long. The client would have timed out the connection after 300 seconds and done additional requests as needed. Perhaps the Scheduler should check how much time has passed since it started to handle the request before deciding: // If the seqno from the host is less than what we expect, // the user must have copied the state file to a different host. I haven't checked all the details, I suppose that when the client times out and closes the connection that should somehow cause the Scheduler process to be closed. That clearly didn't happen for those cases, and maybe that points to something unique to MACs and BOINC 6.2.18, though that could just have been coincidence too. The most recent cases at s...@h are on Windows hosts. -- Joe On 26 Jul 2009 at 21:51, David wrote: > I'll put in that test. > However it's not clear that's the problem; > high DB load would not cause host lookup by ID to fail. > -- David On 27 Jul 2009 at 9:54, Jonathan wrote: > Hi! > Just read this.. and have a few questions/ideas: > > Is this happening throughout every kind of computer? > I'm asking, because here with SIMAP, we see one such a happening every > now and then, > with a Mac connecting to the scheduler and then getting a reattachment. > > We have traced it to a *very* long running scheduler request that seems > to be on hold for several hours, > during that time, the Mac makes some more scheduler requests, increasing > the request_sequence_id; > Then suddenly, (why, we don't know or understand - or even can guess) > the Mac pics up the long standing scheduler request, > which suddenly returns and complains about not-in-order > request_sequence_ids, effectively detaching the host. > > Might this be the case on Seti too? > > However, I don't see a chance for the seti-guys to track this down, > because it took me - with our considerably smaller > database/hostcount/workload - > about three month to track it down. Though this lengthy it was only > 'cause of our cyclic work-distribution scheme. > > Best > -Jonathan > from the BoincSIMAP team > Josef W. Segur wrote: > > Every once in awhile, a user will note in the s...@h Number Crunching forum > > that work being processed on a host has been marked "Client detached" in > > the task lists. It happens often enough to be a familiar complaint, > > seemingly most often when the BOINC database is most heavily loaded. As > > a pure guess, finding a host by it's hostid might fail under those > > circumstances. > > > > Looking through the authenticate_user() code in handle_request.cpp, the > > logic when hostid is either missing or invalid but the host is found > > based on host_cpid assumes that's sufficient to assume the host has > > detached and reattached. I suggest adding a test for > > (g_request->other_results.size() == 0). It should be unconditionally true > > for an actual detach/reattach, otherwise if it's true it does no harm to > > mark any results "Client detached" because the host didn't know of them. > > > > The same test could also be used when both hostid and host_cpid have failed > > to locate the host, but find_host_by_other() finds a close match. I don't > > know whether that's a good idea, that matching seems adequate but having > > failed the "id" methods seems to imply a sufficiently bad situation that a > > fresh start may be needed. Judgement call, I guess. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
