On 6/13/10, Rod Walker <[email protected]> wrote:
> Hi,
> I submitted 100 boinc jobs to the batch system, where over 100 slots
> were free. This meant they started as soon as the SGE scheduler assigned
> WNs, and all within 2 seconds. Of the 100, 15 failed with the "Another
> scheduler instance is running" message, so I think this is consistent
> with the RPC in progress hypothesis. The contact with the project server
> seem to take around 3 seconds
>
> 11-Jun-2010 16:05:59 [http://www.worldcommunitygrid.org/] Sending
> scheduler requ
> est: Project initialization.
> 11-Jun-2010 16:05:59 [http://www.worldcommunitygrid.org/] Requesting new
> tasks
> 11-Jun-2010 16:06:02 [World Community Grid] Scheduler request completed:
> got 0 n
> ew tasks
> 11-Jun-2010 16:06:02 [World Community Grid] Message from project server:
> Another
>   scheduler instance is running for this host
>
> I think I can reduce the 15% failure rate, to something negligible, by
> putting a random sleep. Trying this, I notice 33 from 100 fail with a
> new error:-
>
> [World Community Grid] Message from project server: Not sending work -
> last request too recent: 5 sec
>
> Looking back to my first 100 test, there were 5 of these too. I don't
> suppose anyone knows what the minimum time between requests is? I'll put
> a sleep and run boinc again if the first fails quickly(i.e. with one of
> these messages).

Having all the clients using the same host ID is a bad idea anyway.

For example, if the project has "resend lost work" enabled (I don't
know if WCG does), the first client will get some work, and when the
second client asks for work, the server will think the client "lost"
the previous work and send the *same* work again. So you'd have all
your cluster nodes processing the same tasks!

I don't know why all your clients are getting the same host ID. Make
sure all clients are using a separate data directory, and that the
directories *don't* have a client_state.xml when the clients first
run.

@David: does the server use the hostname to decide if the machine is
the "same" and reuse host ID?

-- 
Nicolas
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to