I found the source of the problem. When I ran upgrade, a fresh table host_app_version was created (I upgraded from much older revision) and all hosts had their consecutive_valid set to 0. This was just after a high priority batch of tasks entered the server's queue, so almost everything in the feeder queue was high priority. Any task with priority higher than 0 goes to reliable hosts, since there were no reliable hosts (consecutive_valid >=10 required) the queue got stuck. That's why I thought that a bit older scheduler (22488) worked, when I tested it the server had a small number of low priority tasks still available.
/TJM http://www.enigmaathome.net On Sat, 23 Oct 2010 10:39:37 -0700 David Anderson <[email protected]> wrote: >I'm not seeing any problem with the scheduler in >the latest trunk revision (22593). Try that. >-- David > >On 23-Oct-2010 8:43 AM, Slawomir Rzeznicki wrote: >> Hello, >> >> Recently I upgraded my server to revision 22566 which >>seems to have seriously bugged scheduler. >> I enabled most (if not all) of the sched debug logs - I >>can't find anything unusual there, however it replies >>with no work available to all work requests, just like it >>would with empty feeder queue. >> I've checked the feeder already and it seems to work >>fine, the queue is filled with workunits and so is shmem. >> Also, various people reported that sched does not accept >>any work reported back to the server, however I can't >>confirm it right now because I haven't seen any logs yet >>and I don't have tasks left to report myself. >> >> I'd appreciate any suggestions on how to debug this >>further, right now the only thing I know that the bug >>must be somewhere between changesets 22488 and 22566, >>because 22488 works for sure. >> Below is a sample from sched log after I did a request >>from one of my PCs. >> >> 2010-10-23 10:08:05.6313 [PID=7397 ] Request: [USER#1] >>[HOST#3757] [IP 69.12.216.209] client 6.12.4 >> 2010-10-23 10:08:05.6319 [PID=7397 ] [send] Not using >>matchmaker scheduling; Not using EDF sim >> 2010-10-23 10:08:05.6320 [PID=7397 ] [send] CPU: req >>97397.08 sec, 2.00 instances; est delay 0.00 >> 2010-10-23 10:08:05.6320 [PID=7397 ] [send] CUDA: req >>0.00 sec, 0.00 instances; est delay 0.00 >> 2010-10-23 10:08:05.6320 [PID=7397 ] [send] >>work_req_seconds: 97397.08 secs >> 2010-10-23 10:08:05.6320 [PID=7397 ] [send] available >>disk 54.25 GB, work_buf_min 0 >> 2010-10-23 10:08:05.6320 [PID=7397 ] [send] >>active_frac 0.999270 on_frac 0.992676 >> 2010-10-23 10:08:05.6320 [PID=7397 ] Anonymous >>platform app versions: >> 2010-10-23 10:08:05.6320 [PID=7397 ] app: >>enigma_m4_2 version 522 cpus 1.00 cudas 0.00 atis 0.00 >>flops 3.482478G >> 2010-10-23 10:08:05.6324 [PID=7397 ] [send] >>[AV#6000002] not reliable; cons valid 0< 10 >> 2010-10-23 10:08:05.6324 [PID=7397 ] [send] >>set_trust: cons valid 0< 10, don't use single >>replication >> 2010-10-23 10:08:05.6525 [PID=7397 ] Sending reply to >>[HOST#3757]: 0 results, delay req 181.80 >> 2010-10-23 10:08:05.6528 [PID=7397 ] Scheduler ran >>0.026 seconds >> >> /TJM >> http://www.enigmaathome.net >> >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >_______________________________________________ >boinc_dev mailing list >[email protected] >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >To unsubscribe, visit the above URL and >(near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
