Re: [boinc_dev] 6.6.36, host.active_frac = 100% means no work?

Ian Hay Tue, 23 Jun 2009 17:09:58 -0700

Jorden van der Elst wrote on 23/06/2009 21:03:

Is it possible that there's a bug in BOINC where when host.active_frac
= 1 (100%) it won't request any work?
I ask this as on DrugDiscovery we've seen the following happen to
people (and to myself);


Host asks for work and gets as a message:
23-Jun-09 21:36:02 DrugDiscovery Message from server: GROMACS with
Nvidia GPU is not available for your type of computer.
23-Jun-09 21:36:02 DrugDiscovery Message from server: (won't finish in
time) BOINC runs 98.3% of time, computation enabled 100.0% of that

I don't think the GPU message has anything to do with it as that was
something people with CUDA GPUs got as well. It's the second line.
Now, stupid me, I didn't log anything at the time, or saved the
relevant lines from client_state.xml or any of the sched_reply or
sched_request files... I just reset the project and then got work.

After the last sched_request my numbers are:
    <on_frac>0.974275</on_frac>
    <connected_frac>0.967568</connected_frac>
    <active_frac>0.968222</active_frac>

So I am assuming that any number under 100% will do, but 100% will
stop work from getting in.
Trouble is that I don't see it reflected in the source code. The only
reference is when host.active_frac is above 100% (or > 1), to show a
message that this is something of an impossibility and that we're
resetting to 100% (or 1). ;-)

Any ideas?

My CPU only quad core system was having the same problem getting workfrom malariacontrol.net. The system had been starved of MCDN work forover 2 weeks and a new batch of work was generated yesterday.

The debug output in the attached file stdoutdae_wfd.txt covers acomplete work fetch cycle when a scheduler request was sent to MCDN.Unfortunately I didn't capture the scheduler request relating to thatparticular work fetch cycle, but sched_request_wfd_xml.txt has therelevant information from a later request to MCDN (when I'd disabled thedebug output).


Resource shares for the 4 active projects on the system are:

600 for CPDN (3 tasks in progress; 53, 73 and 1248 hours to completion)
600 for CPDN Beta (2 task in progress; 64 and 180 hours to completion)
100 for WCG (1 task in progress; 5 hours to completion)
100 for MCDN (work starved)

None of the 6 runnable tasks were anywhere near having deadline problems(9, 26 and 28 months for the CPDN tasks; 9 months for both CPDN Betatasks and 9 days for the WCG task)

Both files show the core client requesting 6171.43 seconds of work fromMCDN. The project's DCF would have estimated 90 minutes CPU time for asingle task, so I'd have expected the scheduler to allocate 2 tasks.MCDN has a relative resource share of just over 7%, and with 4 coresavailable that 10800 seconds of work ought to have been completed inless than 39000 seconds of wall time (around 11 hours) with standardround robin scheduling. That's well within the (approx) 3.5 daydeadline for MCDN tasks.

Assuming the MCDN scheduler hasn't been set up to do workload simulationthe problem is likely to be down to the <estimated_delay> value includedin the scheduler request. In my case that would have been 346143.49from line 48 of the attached debug.

I eventually managed to force download of work from MCDN by the simpleact of suspending the CPDN task which has over 2 years to complete anestimated 52 days of work at a relative resource share of just under 43%.


--
Ian

23-Jun-2009 13:21:47 [climateprediction.net] [debt] CPU debt -158784.24 delta 
43.25 share frac 0.43 (600.00/1400.00) secs 242.19 rsc_secs 60.55
23-Jun-2009 13:21:47 [CPDN Beta] [debt] CPU debt -17.30 delta -17.30 share frac 
0.43 (600.00/1400.00) secs 242.19 rsc_secs 121.09
23-Jun-2009 13:21:47 [or...@home] [debt] CPU ineligible; debt -0.00
23-Jun-2009 13:21:47 [...@home] [debt] CPU ineligible; debt -0.00
23-Jun-2009 13:21:47 [malariacontrol.net] [debt] CPU debt 17.30 delta 17.30 
share frac 0.07 (100.00/1400.00) secs 242.19 rsc_secs 0.00
23-Jun-2009 13:21:47 [World Community Grid] [debt] CPU debt -90392.21 delta 
-43.25 share frac 0.07 (100.00/1400.00) secs 242.19 rsc_secs 60.55
23-Jun-2009 13:21:47 [---] [debt] CPU debt: adding offset -17.30
23-Jun-2009 13:21:47 [---] [cpu_sched_debug] enforce_schedule(): start
23-Jun-2009 13:21:47 [---] [cpu_sched_debug] preliminary job list:
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] 0: 
hadcm3l_cobg_2000_2_000013347_4
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] 1: 
hadam3p_monx_1994_2_006133799_1
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] 2: 
hadcm3l_ckgv_2000_2_000013319_0
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] 3: 
hadcm3istd_cq3s_1920_160_06017964_6
23-Jun-2009 13:21:47 [---] [cpu_sched_debug] final job list:
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] 0: 
hadcm3l_cobg_2000_2_000013347_4
23-Jun-2009 13:21:47 [World Community Grid] [cpu_sched_debug] 1: 
R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] 2: 
hadam3p_monx_1994_2_006133799_1
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] 3: 
hadcm3l_ckgv_2000_2_000013319_0
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] 4: 
hadcm3istd_cq3s_1920_160_06017964_6
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] scheduling 
hadcm3l_cobg_2000_2_000013347_4
23-Jun-2009 13:21:47 [World Community Grid] [cpu_sched_debug] scheduling 
R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] scheduling 
hadam3p_monx_1994_2_006133799_1
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] scheduling 
hadcm3l_ckgv_2000_2_000013319_0
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] all CPUs used, 
skipping hadcm3istd_cq3s_1920_160_06017964_6
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] 
hadcm3istd_cq3s_1920_160_06017964_6 sched state 1 next 1 task state 9
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] 
hadcm3istd_crbg_1920_160_06019536_0 sched state 1 next 1 task state 9
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] 
hadcm3l_cobg_2000_2_000013347_4 sched state 2 next 2 task state 1
23-Jun-2009 13:21:47 [CPDN Beta] [cpu_sched_debug] 
hadcm3l_ckgv_2000_2_000013319_0 sched state 2 next 2 task state 1
23-Jun-2009 13:21:47 [climateprediction.net] [cpu_sched_debug] 
hadam3p_monx_1994_2_006133799_1 sched state 2 next 2 task state 1
23-Jun-2009 13:21:47 [World Community Grid] [cpu_sched_debug] 
R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7 sched state 2 next 2 task state 1
23-Jun-2009 13:21:47 [---] [cpu_sched_debug] enforce_schedule: end
23-Jun-2009 13:21:48 [---] [rr_sim] rr_sim start: work_buf_total 86400.00
23-Jun-2009 13:21:48 [climateprediction.net] [rr_sim] 0.00: starting 
hadcm3istd_cq3s_1920_160_06017964_6
23-Jun-2009 13:21:48 [climateprediction.net] [rr_sim] 0.00: starting 
hadcm3istd_crbg_1920_160_06019536_0
23-Jun-2009 13:21:48 [climateprediction.net] [rr_sim] 0.00: starting 
hadam3p_monx_1994_2_006133799_1
23-Jun-2009 13:21:48 [CPDN Beta] [rr_sim] 0.00: starting 
hadcm3l_cobg_2000_2_000013347_4
23-Jun-2009 13:21:48 [CPDN Beta] [rr_sim] 0.00: starting 
hadcm3l_ckgv_2000_2_000013319_0
23-Jun-2009 13:21:48 [World Community Grid] [rr_sim] 0.00: starting 
R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7
23-Jun-2009 13:21:48 [World Community Grid] [rr_sim] 0.00: 
R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7 finishes after 76402.24 
(54704.91G/0.72G)
23-Jun-2009 13:21:48 [CPDN Beta] [rr_sim] 76402.24: 
hadcm3l_cobg_2000_2_000013347_4 finishes after 195268.08 (416577.42G/2.13G)
23-Jun-2009 13:21:48 [climateprediction.net] [rr_sim] 271670.32: 
hadam3p_monx_1994_2_006133799_1 finishes after 74473.16 (105918.79G/1.42G)
23-Jun-2009 13:21:48 [climateprediction.net] [rr_sim] 346143.49: 
hadcm3istd_cq3s_1920_160_06017964_6 finishes after 64766.51 (137785.64G/2.13G)
23-Jun-2009 13:21:48 [CPDN Beta] [rr_sim] 410909.99: 
hadcm3l_ckgv_2000_2_000013319_0 finishes after 273751.27 (632677.98G/2.31G)
23-Jun-2009 13:21:48 [climateprediction.net] [rr_sim] 684661.27: 
hadcm3istd_crbg_1920_160_06019536_0 finishes after 3984050.02 
(9182065.32G/2.30G)
23-Jun-2009 13:21:48 [malariacontrol.net] chosen: CPU starved
23-Jun-2009 13:21:48 [---] [wfd] ------- start work fetch state -------
23-Jun-2009 13:21:48 [---] [wfd] target work buffer: 43200.00 + 43200.00 sec
23-Jun-2009 13:21:48 [---] [wfd] CPU: shortfall 0.00 nidle 0.00 est. delay 
346143.49 RS fetchable 1400.00 runnable 1300.00
23-Jun-2009 13:21:48 [climateprediction.net] [wfd] CPU: fetch share 0.43 debt 
-158801.54 backoff dt 0.00 int 0.00 (overworked)
23-Jun-2009 13:21:48 [CPDN Beta] [wfd] CPU: fetch share 0.43 debt -34.60 
backoff dt 0.00 int 0.00
23-Jun-2009 13:21:48 [or...@home] [wfd] CPU: fetch share 0.00 debt -0.00 
backoff dt 0.00 int 0.00 (no new tasks)
23-Jun-2009 13:21:48 [...@home] [wfd] CPU: fetch share 0.00 debt -0.00 backoff 
dt 0.00 int 0.00 (no new tasks)
23-Jun-2009 13:21:48 [malariacontrol.net] [wfd] CPU: fetch share 0.07 debt 0.00 
backoff dt 0.00 int 120.00
23-Jun-2009 13:21:48 [World Community Grid] [wfd] CPU: fetch share 0.07 debt 
-90409.51 backoff dt 0.00 int 0.00 (overworked)
23-Jun-2009 13:21:48 [climateprediction.net] [wfd] overall_debt -158802
23-Jun-2009 13:21:48 [CPDN Beta] [wfd] overall_debt -35
23-Jun-2009 13:21:48 [or...@home] [wfd] overall_debt -0
23-Jun-2009 13:21:48 [...@home] [wfd] overall_debt -0
23-Jun-2009 13:21:48 [malariacontrol.net] [wfd] overall_debt 0
23-Jun-2009 13:21:48 [World Community Grid] [wfd] overall_debt -90410
23-Jun-2009 13:21:48 [---] [wfd] ------- end work fetch state -------
23-Jun-2009 13:21:48 [malariacontrol.net] [wfd] request: CPU (6171.43 sec, 0) 
CUDA (0.00 sec, 0)
23-Jun-2009 13:21:48 [malariacontrol.net] [sched_op_debug] Starting scheduler 
request
23-Jun-2009 13:21:48 [malariacontrol.net] Sending scheduler request: To fetch 
work.
23-Jun-2009 13:21:48 [malariacontrol.net] Requesting new tasks
23-Jun-2009 13:21:48 [malariacontrol.net] [sched_op_debug] CPU work request: 
6171.43 seconds; 0 idle CPUs
23-Jun-2009 13:21:53 [malariacontrol.net] Scheduler request completed: got 0 
new tasks
23-Jun-2009 13:21:53 [malariacontrol.net] [sched_op_debug] Server version 607
23-Jun-2009 13:21:53 [malariacontrol.net] Message from server: No work sent
23-Jun-2009 13:21:53 [malariacontrol.net] Message from server: No work is 
available for malariacontrol.net test version
23-Jun-2009 13:21:53 [malariacontrol.net] Message from server: No work is 
available for Prediction of Malaria Prevalence
23-Jun-2009 13:21:53 [malariacontrol.net] Message from server: No work is 
available for Estimation of parameters of infection dynamics (variable 
duration, max 4h)
23-Jun-2009 13:21:53 [malariacontrol.net] Message from server: (won't finish in 
time) BOINC runs 98.8% of time, computation enabled 100.0% of that
23-Jun-2009 13:21:53 [malariacontrol.net] Project requested delay of 11 seconds
23-Jun-2009 13:21:53 [malariacontrol.net] [wfd] backing off CPU 158 sec
23-Jun-2009 13:21:53 [malariacontrol.net] [sched_op_debug] Deferring 
communication for 11 sec
23-Jun-2009 13:21:53 [malariacontrol.net] [sched_op_debug] Reason: requested by 
project
23-Jun-2009 13:21:53 [---] [work_fetch_debug] Request work fetch: RPC complete
23-Jun-2009 13:21:58 [---] [rr_sim] rr_sim start: work_buf_total 86400.00
23-Jun-2009 13:21:58 [climateprediction.net] [rr_sim] 0.00: starting 
hadcm3istd_cq3s_1920_160_06017964_6
23-Jun-2009 13:21:58 [climateprediction.net] [rr_sim] 0.00: starting 
hadcm3istd_crbg_1920_160_06019536_0
23-Jun-2009 13:21:58 [climateprediction.net] [rr_sim] 0.00: starting 
hadam3p_monx_1994_2_006133799_1
23-Jun-2009 13:21:58 [CPDN Beta] [rr_sim] 0.00: starting 
hadcm3l_cobg_2000_2_000013347_4
23-Jun-2009 13:21:58 [CPDN Beta] [rr_sim] 0.00: starting 
hadcm3l_ckgv_2000_2_000013319_0
23-Jun-2009 13:21:58 [World Community Grid] [rr_sim] 0.00: starting 
R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7
23-Jun-2009 13:21:58 [World Community Grid] [rr_sim] 0.00: 
R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7 finishes after 76460.15 
(54746.39G/0.72G)
23-Jun-2009 13:21:58 [CPDN Beta] [rr_sim] 76460.15: 
hadcm3l_cobg_2000_2_000013347_4 finishes after 195198.43 (416428.90G/2.13G)
23-Jun-2009 13:21:58 [climateprediction.net] [rr_sim] 271658.59: 
hadam3p_monx_1994_2_006133799_1 finishes after 74454.54 (105892.31G/1.42G)
23-Jun-2009 13:21:58 [climateprediction.net] [rr_sim] 346113.13: 
hadcm3istd_cq3s_1920_160_06017964_6 finishes after 64786.70 (137828.63G/2.13G)
23-Jun-2009 13:21:58 [CPDN Beta] [rr_sim] 410899.83: 
hadcm3l_ckgv_2000_2_000013319_0 finishes after 273750.90 (632677.21G/2.31G)
23-Jun-2009 13:21:58 [climateprediction.net] [rr_sim] 684650.73: 
hadcm3istd_crbg_1920_160_06019536_0 finishes after 3984049.78 
(9182066.09G/2.30G)
23-Jun-2009 13:21:58 [---] [wfd] ------- start work fetch state -------
23-Jun-2009 13:21:58 [---] [wfd] target work buffer: 43200.00 + 43200.00 sec
23-Jun-2009 13:21:58 [---] [wfd] CPU: shortfall 0.00 nidle 0.00 est. delay 
346113.13 RS fetchable 1300.00 runnable 1300.00
23-Jun-2009 13:21:58 [climateprediction.net] [wfd] CPU: fetch share 0.46 debt 
-158801.54 backoff dt 0.00 int 0.00 (overworked)
23-Jun-2009 13:21:58 [CPDN Beta] [wfd] CPU: fetch share 0.46 debt -34.60 
backoff dt 0.00 int 0.00
23-Jun-2009 13:21:58 [or...@home] [wfd] CPU: fetch share 0.00 debt -0.00 
backoff dt 0.00 int 0.00 (no new tasks)
23-Jun-2009 13:21:58 [...@home] [wfd] CPU: fetch share 0.00 debt -0.00 backoff 
dt 0.00 int 0.00 (no new tasks)
23-Jun-2009 13:21:58 [malariacontrol.net] [wfd] CPU: fetch share 0.00 debt 0.00 
backoff dt 153.06 int 240.00 (comm deferred)
23-Jun-2009 13:21:58 [World Community Grid] [wfd] CPU: fetch share 0.08 debt 
-90409.51 backoff dt 0.00 int 0.00 (overworked)
23-Jun-2009 13:21:58 [climateprediction.net] [wfd] overall_debt -158802
23-Jun-2009 13:21:58 [CPDN Beta] [wfd] overall_debt -35
23-Jun-2009 13:21:58 [or...@home] [wfd] overall_debt -0
23-Jun-2009 13:21:58 [...@home] [wfd] overall_debt -0
23-Jun-2009 13:21:58 [malariacontrol.net] [wfd] overall_debt 0
23-Jun-2009 13:21:58 [World Community Grid] [wfd] overall_debt -90410
23-Jun-2009 13:21:58 [---] [wfd] ------- end work fetch state -------
23-Jun-2009 13:21:58 [---] [wfd] No project chosen for work fetch
23-Jun-2009 13:22:04 [---] [work_fetch_debug] Request work fetch: Backoff ended 
for malariacontrol.net

<scheduler_request>
    <core_client_major_version>6</core_client_major_version>
    <core_client_minor_version>6</core_client_minor_version>
    <core_client_release>36</core_client_release>
    <resource_share_fraction>0.062500</resource_share_fraction>
    <rrs_fraction>0.076923</rrs_fraction>
    <prrs_fraction>0.071429</prrs_fraction>
    <duration_correction_factor>1.041276</duration_correction_factor>
    <sandbox>1</sandbox>
    <work_req_seconds>6171.428571</work_req_seconds>
    <cpu_req_secs>6171.428571</cpu_req_secs>
    <cpu_req_instances>0</cpu_req_instances>
    <estimated_delay>329866.255926</estimated_delay>
<time_stats>
    <on_frac>0.988470</on_frac>
    <connected_frac>0.981387</connected_frac>
    <active_frac>0.999552</active_frac>
</time_stats>
<in_progress_results>
    <ip_result>
        <name>hadcm3istd_cq3s_1920_160_06017964_6</name>
        <report_deadline>1314757562.000000</report_deadline>
        <cpu_time_remaining>260807.591926</cpu_time_remaining>
    </ip_result>
    <ip_result>
        <name>hadcm3istd_crbg_1920_160_06019536_0</name>
        <report_deadline>1319369584.000000</report_deadline>
        <cpu_time_remaining>4473213.968069</cpu_time_remaining>
    </ip_result>
    <ip_result>
        <name>hadam3p_monx_1994_2_006133799_1</name>
        <report_deadline>1274999042.000000</report_deadline>
        <cpu_time_remaining>200564.208541</cpu_time_remaining>
    </ip_result>
    <ip_result>
        <name>hadcm3l_cobg_2000_2_000013347_4</name>
        <report_deadline>1274740194.000000</report_deadline>
        <cpu_time_remaining>236606.380053</cpu_time_remaining>
    </ip_result>
    <ip_result>
        <name>hadcm3l_ckgv_2000_2_000013319_0</name>
        <report_deadline>1275093201.000000</report_deadline>
        <cpu_time_remaining>648154.723334</cpu_time_remaining>
    </ip_result>
    <ip_result>
        <name>R00295_9e09e8ff49b4b9130f99f1075fb37832_00_7</name>
        <report_deadline>1246620982.000000</report_deadline>
        <cpu_time_remaining>18115.231243</cpu_time_remaining>
    </ip_result>
</in_progress_results>
</scheduler_request>

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.36, host.active_frac = 100% means no work?

Reply via email to