The 12-second delay might be related to DNS name resolution.
On Windows, we disabled async name resolution in libcurl because of random
crashes in the async code.
----- Rom
-----Original Message-----
From: boinc_dev [mailto:boinc_dev-boun...@ssl.berkeley.edu] On Behalf Of Jacob
Klein
Sent: Monday, May 18, 2015 2:12 PM
To: David Anderson; Richard Haselgrove; Seke Rob
Cc: BOINC Development
Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without
ensuring they're empty
Process Monitor can be used to "watch the things a process does" (you have to
set up correct filters, etc.)... but I'm not sure if that includes sleeps. But if the
process is waiting on a file or something, though, it should be able to tell you. Worth
looking into.
https://technet.microsoft.com/en-us/library/bb896645.aspx
Regards,
Jacob
Date: Mon, 18 May 2015 10:41:16 -0700
From: da...@ssl.berkeley.edu
To: r.haselgr...@btopenworld.com; onec...@hotmail.com; jacob_w_kl...@msn.com
CC: boinc_dev@ssl.berkeley.edu
Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without
ensuring they're empty
I looked at this and couldn't figure out the source of the 12-sec
delay.
In general, delays could happen because
1) the client does something that takes a long time (like copying
a 5 GB file)
2) the client sleeps (i.e. calls boinc_sleep()).
It does this in a few situations,
like backing off and retrying a file system operation.
But there's no indication that either of these is happening here.
Does Windows have a way of logging the system calls that a process
makes
(like strace on Unix)?
If so that might reveal what the client is doing during those 12
seconds.
-- David
On 16-May-2015 8:01 AM, Richard
Haselgrove wrote:
Here is the message
log file for a GPUGrid task finish. The 12-second delay
appears again between 14:26:35 and 14:26:47 - that's after
the slot directory has been cleared, and the exiting task
has changed state from 'running' to 'uploading'. Two new
tasks have been assigned to the GPU, but their (small)
startup files have not yet been linked to their respective
slot directories.
I also attach
directory listings for the slot and GPUGrid project folders
at various stages of the cleanup: the slot held 34 files
totalling 44,186,727 bytes, which doesn't sound excessive:
the largest file deletion (94,783,960 bytes) occurred
several minutes later, when that file finished uploading.
I'll enable similar
logging and watch what happens when the next GPUGrid task
starts up, but from memory, the disruption to BOINC is less
severe at startup.
On Tuesday,
12 May 2015, 23:29, David Anderson
<da...@ssl.berkeley.edu> wrote:
BTW: the client isn't
completely single-threaded;
it uses a separate thread to do CPU throttling.
It would be feasible to also use separate threads
for serving GUI RPC connections,
which would allow client to remain responsive even
while
e.g. copying thousands of files to a slot dir.
-- David
On 12-May-2015 2:40 AM, Seke Rob wrote:
> Reminds me of the Clean Energy Project, Phase 2
and why we have app_config and
> <max_concurrent> and a default control of
allowing 1 'In Progress' on a host. This
> project sets up in slot copying near 6700 files
[symlinking proposed long ago as
> is done on several other WCG projects for the
static files]. If more than one CEP2
> is started the machine feels at times like a
snail, responsiveness of the BOINC
> manager is poor, many a time the less powerful
systems incurring error zero status
> exits or total fail. On an 8 core observed it
could take over an hour before
> actual computing commenced [CPU time logged].
Boot cycle requires manually
> starting of tasks one by one. Kevin Reed few
years ago raised a ticket for
> staggered starting, where the models can reach
several GB and bigger in the
> coming. At any rate, as much as these 6700 files
are copied, they also then are
> needing of deletion at completion [physical or
symlink references]. The effect of
> starting 1 CEP2 and finishing / packaging /
zipping and transmitting can easily
> lead to several minutes of there not being any
computing, just whirring, for
> minutes, just elapsed being logged. The more run
the more the issue compounds,
> with the effect of what many incur, the exit zero
status series, resetting to
> start or last checkpoint with often hours of
computing time lost.
>
> Maybe you'd like to get in touch with your
confederates at WCG [Keith Uplinger],
> to discuss the issue further as this is now
nearing a 5 year continues frustration
> [June 2010 launch, and a huge limitation on the
speed of progress on this project].
>
> --SekeRob.
>
> On 12-5-2015 1:55, David Anderson wrote:
>> That delay looks like it's caused by deleting
files or by process cleanup.
>> Does GPUGrid make lots of (non-output) files
in the slot dir?
>>
>> Please try to repro it with slot_debug,
task_debug, and heartbeat_debug set
>> (gui_rpc_debug not needed).
>>
>> -- David
>>
>> On 11-May-2015 10:54 AM, Richard Haselgrove
wrote:
>>> Here's another example of a case where
BOINC finds that it can't walk and chew
>>> gum at the same time. The event of
interest is
>>>
>>> 11/05/2015 18:35:34 | GPUGRID |
Computation for task
>>>
e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0
finished
>>>
>>> Following that, there's a 12-second
interval where neither heartbeats nor GUI
>>> RPC traffic was logged: during that time,
the Task tab of the Manager was
>>> unchanging, not showing the regular
update of elapsed time for running tasks.
>>>
>>> async_file_debug was active at the time,
but found no events to log.
>>>
>>> These particular GPUGrid tasks generate
around 90 MB of upload files, but I
>>> think they are generated directly in the
project folder and don't need to be
>>> copied anywhere.
>>>
>>> Main log as attached file only.
>>>
>>> I'll catch a CMS-dev log later this
evening, but after that, I'll be away for a
>>> few days and I'll have to leave the
bug-chase until the weekend.
>>>
>>>
>>>
>>>
>>> On Monday, 11 May 2015, 9:42, Jacob Klein
<jacob_w_kl...@msn.com>
wrote:
>>>
>>>
>>>
>>> I have seen this problem before, where
the UI becomes unresponsive. If I
>>> recall, it happens when a T4T task is
being set up (ie: after everything was
>>> downloaded). For me, I don't recall
the problem ever "screwing over other
>>> tasks", though.
>>>
>>> Try this to reproduce it: Attach to
T4T, and get a task. It may take a while
>>> to do that download, so you can "step
away" for a bit. Then, once that task
>>> is going, abort it. Downloading the
2nd task should be instantaneous
>>> (nothing really to download), but
instantiation of that 2nd task should
>>> cause the UI to hang (showing the
"Please wait" messagebox in the manager).
>>>
>>> Does that help?
>>> > Date: Sun, 10 May 2015 23:19:24
-0700
>>> > From: da...@ssl.berkeley.edu
<mailto:da...@ssl.berkeley.edu>
>>> > To: r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>;
>>> onec...@hotmail.com
<mailto:onec...@hotmail.com>
>>> > CC: boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
>>> > Subject: Re: [boinc_alpha] BOINC
re-using slot directories without
>>> ensuring they're empty
>>> >
>>> > I did some initial testing and
couldn't repro this;
>>> > the client remains responsive
while copying a 5 GB file to a slot dir.
>>> > Does anyone else see this
behavior?
>>> >
>>> > While testing this, please set
"async_file_debug" log flag.
>>> > This says when asynchronous file
operations start and end.
>>> >
>>> > -- David
>>> >
>>> > On 10-May-2015 12:31 PM, Richard
Haselgrove wrote:
>>> > > One thing that may need
attention if very large files become the norm is
>>> the
>>> > > single-threaded nature of
some parts of the core client. My 1-hour CMS
>>> test has
>>> > > just finished, and a new
24-hour test started.
>>> > >
>>> > >
>>> > > I watched this happening,
and part of the process is copying a 1.33 GB
>>> initial
>>> > > .vmi image file (downloaded
previously by BOINC from CERN) from the project
>>> > > directory to the slot
directory. This took about 90 seconds: during that
>>> time, all
>>> > > Manager updating stopped.
I'm sure it's the copying process which inhibited
>>> > > updates: I was watching the
slot directory, and the .vmi image file had
>>> appeared,
>>> > > but other essential startup
files hadn't.
>>> > >
>>> > >
>>> > > When BOINC regained its
ability to communicate, three running tasks had
>>> exited
>>> > > with the dreaded (and false)
'you may need to reset the project' advice.
>>> inline
>>> > > log follows: because my last
log got mangled by my ISP's new mail
>>> interface, I'll
>>> > > attach it as a text file as
well.
>>> > >
>>> > >
>>> > > 10/05/2015 20:12:56 | LHC@home
<mailto:LHC@home>
1.0 | Computation for task
>>> > >
>>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
>>>
>>> > > finished
>>> > > 10/05/2015 20:12:56 |
CMS-dev | Starting task CMS_31107_1427806626.783437_0
>>> > > 10/05/2015 20:12:56 |
CMS-dev | [cpu_sched] Starting task
>>> > >
CMS_31107_1427806626.783437_0 using CMS version 4615
(vbox64) in slot 7
>>> > > 10/05/2015 20:14:25 |
climateprediction.net | Task
>>> > >
hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero
status but no
>>> 'finished' file
>>> > > 10/05/2015 20:14:25 |
climateprediction.net | If this happens repeatedly
>>> you may
>>> > > need to reset the project.
>>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home>
| Task
>>> > >
wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero
status but no
>>> 'finished' file
>>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home>
| If
>>> this happens repeatedly you may need
>>> > > to reset the project.
>>> > > 10/05/2015 20:14:25 | SETI@home
<mailto:SETI@home>
| Task
>>>
05jl12ab.3911.10292.438086664199.12.207_1
>>> > > exited with zero status but
no 'finished' file
>>> > > 10/05/2015 20:14:25 | SETI@home
<mailto:SETI@home>
| If this happens
>>> repeatedly you may need to reset
>>> > > the project.
>>> > > 10/05/2015 20:14:25 |
climateprediction.net | [cpu_sched] Restarting task
>>> > >
hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz
version 610 in slot 5
>>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home>
|
>>> [cpu_sched] Restarting task
>>> > >
wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics
version 200 in slot 0
>>> > > 10/05/2015 20:14:25 | SETI@home
<mailto:SETI@home>
| [cpu_sched]
>>> Restarting task
>>> > >
05jl12ab.3911.10292.438086664199.12.207_1 using
setiathome_v7 version
>>> 700 (cuda42)
>>> > > in slot 2
>>> > > 10/05/2015 20:14:27 | LHC@home
<mailto:LHC@home>
1.0 | Started upload of
>>> > >
>>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
>>> > > 10/05/2015 20:14:30 | LHC@home
<mailto:LHC@home>
1.0 | Finished upload of
>>> > >
>>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Sunday, 10 May 2015,
19:59, Seke Rob <onec...@hotmail.com
>>> <mailto:onec...@hotmail.com>>
wrote:
>>> > >
>>> > >
>>> > >
>>> > > Excellent this is all
fixed and tested. Interest is/was that WCG's Clean
>>> > > Energy at some point in
time was to run very large models, talk of
>>> 4-8GB IIRC.
>>> > >
>>> > > --SekeRob
>>> > >
>>> > > On May 10, 2015 20:27,
Richard Haselgrove
>>> <r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>
>>> > > <mailto:r.haselgr...@btopenworld.com
>>> <mailto:r.haselgr...@btopenworld.com>>>
wrote:
>>> > > CMS only has stock
applications configured for delivery to 64-bit
>>> platforms.
>>> > > I've made an anonymous
platform configuration using the 32-bit VBox
>>> Windows
>>> > > wrapper: it has
downloaded and is running its first 1-hour task. If
that
>>> > > completes successfully
(it seems to have reached the
>>> fully-operational stage),
>>> > > I'll try a full 24-hour
task, which under current operational
>>> circumstances
>>> > > should generate a >4
GB file locally.
>>> > >
>>> > >
>>> > > On Sunday, 10 May
2015, 18:28, David Anderson
>>> <da...@ssl.berkeley.edu
<mailto:da...@ssl.berkeley.edu>
>>> > > <mailto:da...@ssl.berkeley.edu
<mailto:da...@ssl.berkeley.edu>>>
wrote:
>>> > >
>>> > >
>>> > >
>>> > > NTFS handles > 4GB
files, even if the hardware and/or OS is only 32-bit.
>>> > > 32-bit versions of
Windows have APIs (like _stat64()) for handling >
>>> 4GB files.
>>> > > BOINC needs to use these;
we fixed one place where it wasn't.
>>> > >
>>> > > On Unix (Linux and Mac),
BOINC uses the regular APIs (like lseek())
>>> but is
>>> > > built with a
>>> > > -D_FILE_OFFSET_BITS=64
flag that causes these functions to 64-bit size.
>>> > > However, it's possible
that BOINC has bugs involving > 4GB files on
>>> Unix too.
>>> > > If anyone has a 32-bit
Linux system, please test with the CMS project.
>>> > >
>>> > > -- David
>>> > >
>>> > > On 10-May-2015 3:58 AM,
--SekeRob wrote:
>>> > > >
>>> > > > Just wondering, with
files over 4GB and a 64 bit lib introduced, is
>>> it not a CMS
>>> > > > project requirement
to run on a 64 bit OS?
>>> > > >
>>> > > >
>>> > >
>>> > >
_______________________________________________
>>> > > boinc_alpha mailing list
>>> > > boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
>>> <mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>
>>> > >
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> > > To unsubscribe, visit the
above URL and
>>> > > (near bottom of page)
enter your email address.
>>>
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
_______________________________________________
>>> > > boinc_alpha mailing list
>>> > > boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
>>> <mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>
>>> > >
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> > > To unsubscribe, visit the
above URL and
>>> > > (near bottom of page)
enter your email address.
>>> > >
>>> > >
>>> >
>>> >
_______________________________________________
>>> > boinc_alpha mailing list
>>> > boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
>>> >
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> > To unsubscribe, visit the above
URL and
>>> > (near bottom of page) enter your
email address.
>>>
>>>
_______________________________________________
>>> boinc_alpha mailing list
>>> boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
>>>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> To unsubscribe, visit the above URL
and
>>> (near bottom of page) enter your email
address.
>>>
>>>
>>
>
>
>
>
------------------------------------------------------------------------------------
> Avast logo <http://www.avast.com/>
>
> This email has been checked for viruses by Avast
antivirus software.
> www.avast.com <http://www.avast.com/>
>
>
_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.