Process Monitor can be used to "watch the things a process does" (you have to set
up correct filters, etc.)... but I'm not sure if that includes sleeps. But if the
process is waiting on a file or something, though, it should be able to tell you.
Worth looking into.
https://technet.microsoft.com/en-us/library/bb896645.aspx
Regards,
Jacob
------------------------------------------------------------------------------------
Date: Mon, 18 May 2015 10:41:16 -0700
From: da...@ssl.berkeley.edu
To: r.haselgr...@btopenworld.com; onec...@hotmail.com; jacob_w_kl...@msn.com
CC: boinc_dev@ssl.berkeley.edu
Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without
ensuring they're empty
I looked at this and couldn't figure out the source of the 12-sec delay.
In general, delays could happen because
1) the client does something that takes a long time (like copying a 5 GB file)
2) the client sleeps (i.e. calls boinc_sleep()).
It does this in a few situations,
like backing off and retrying a file system operation.
But there's no indication that either of these is happening here.
Does Windows have a way of logging the system calls that a process makes
(like strace on Unix)?
If so that might reveal what the client is doing during those 12 seconds.
-- David
On 16-May-2015 8:01 AM, Richard Haselgrove wrote:
Here is the message log file for a GPUGrid task finish. The 12-second delay
appears again between 14:26:35 and 14:26:47 - that's after the slot
directory
has been cleared, and the exiting task has changed state from 'running' to
'uploading'. Two new tasks have been assigned to the GPU, but their (small)
startup files have not yet been linked to their respective slot directories.
I also attach directory listings for the slot and GPUGrid project folders at
various stages of the cleanup: the slot held 34 files totalling 44,186,727
bytes, which doesn't sound excessive: the largest file deletion (94,783,960
bytes) occurred several minutes later, when that file finished uploading.
I'll enable similar logging and watch what happens when the next GPUGrid
task
starts up, but from memory, the disruption to BOINC is less severe at
startup.
On Tuesday, 12 May 2015, 23:29, David Anderson <da...@ssl.berkeley.edu>
<mailto:da...@ssl.berkeley.edu> wrote:
BTW: the client isn't completely single-threaded;
it uses a separate thread to do CPU throttling.
It would be feasible to also use separate threads
for serving GUI RPC connections,
which would allow client to remain responsive even while
e.g. copying thousands of files to a slot dir.
-- David
On 12-May-2015 2:40 AM, Seke Rob wrote:
> Reminds me of the Clean Energy Project, Phase 2 and why we have
app_config and
> <max_concurrent> and a default control of allowing 1 'In Progress' on
a
host. This
> project sets up in slot copying near 6700 files [symlinking proposed
long ago as
> is done on several other WCG projects for the static files]. If more
than one CEP2
> is started the machine feels at times like a snail, responsiveness of
the BOINC
> manager is poor, many a time the less powerful systems incurring error
zero status
> exits or total fail. On an 8 core observed it could take over an hour
before
> actual computing commenced [CPU time logged]. Boot cycle requires
manually
> starting of tasks one by one. Kevin Reed few years ago raised a
ticket for
> staggered starting, where the models can reach several GB and bigger
in the
> coming. At any rate, as much as these 6700 files are copied, they also
then are
> needing of deletion at completion [physical or symlink references].
The
effect of
> starting 1 CEP2 and finishing / packaging / zipping and transmitting
can
easily
> lead to several minutes of there not being any computing, just
whirring,
for
> minutes, just elapsed being logged. The more run the more the issue
compounds,
> with the effect of what many incur, the exit zero status series,
resetting to
> start or last checkpoint with often hours of computing time lost.
>
> Maybe you'd like to get in touch with your confederates at WCG [Keith
Uplinger],
> to discuss the issue further as this is now nearing a 5 year continues
frustration
> [June 2010 launch, and a huge limitation on the speed of progress on
this project].
>
> --SekeRob.
>
> On 12-5-2015 1:55, David Anderson wrote:
>> That delay looks like it's caused by deleting files or by process
cleanup.
>> Does GPUGrid make lots of (non-output) files in the slot dir?
>>
>> Please try to repro it with slot_debug, task_debug, and
heartbeat_debug set
>> (gui_rpc_debug not needed).
>>
>> -- David
>>
>> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
>>> Here's another example of a case where BOINC finds that it can't
walk
and chew
>>> gum at the same time. The event of interest is
>>>
>>> 11/05/2015 18:35:34 | GPUGRID | Computation for task
>>> e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished
>>>
>>> Following that, there's a 12-second interval where neither
heartbeats
nor GUI
>>> RPC traffic was logged: during that time, the Task tab of the
Manager was
>>> unchanging, not showing the regular update of elapsed time for
running
tasks.
>>>
>>> async_file_debug was active at the time, but found no events to log.
>>>
>>> These particular GPUGrid tasks generate around 90 MB of upload
files,
but I
>>> think they are generated directly in the project folder and don't
need
to be
>>> copied anywhere.
>>>
>>> Main log as attached file only.
>>>
>>> I'll catch a CMS-dev log later this evening, but after that, I'll be
away for a
>>> few days and I'll have to leave the bug-chase until the weekend.
>>>
>>>
>>>
>>>
>>> On Monday, 11 May 2015, 9:42, Jacob Klein <jacob_w_kl...@msn.com
<mailto:jacob_w_kl...@msn.com>> wrote:
>>>
>>>
>>>
>>> I have seen this problem before, where the UI becomes
unresponsive.
If I
>>> recall, it happens when a T4T task is being set up (ie: after
everything was
>>> downloaded). For me, I don't recall the problem ever "screwing
over
other
>>> tasks", though.
>>>
>>> Try this to reproduce it: Attach to T4T, and get a task. It may
take a while
>>> to do that download, so you can "step away" for a bit. Then, once
that task
>>> is going, abort it. Downloading the 2nd task should be
instantaneous
>>> (nothing really to download), but instantiation of that 2nd task
should
>>> cause the UI to hang (showing the "Please wait" messagebox in the
manager).
>>>
>>> Does that help?
>>> > Date: Sun, 10 May 2015 23:19:24 -0700
>>> > From: da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
<mailto:da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>>
>>> > To: r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>
<mailto:r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>>;
>>> onec...@hotmail.com <mailto:onec...@hotmail.com>
<mailto:onec...@hotmail.com <mailto:onec...@hotmail.com>>
>>> > CC: boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
<mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>
>>> > Subject: Re: [boinc_alpha] BOINC re-using slot directories
without
>>> ensuring they're empty
>>> >
>>> > I did some initial testing and couldn't repro this;
>>> > the client remains responsive while copying a 5 GB file to a
slot
dir.
>>> > Does anyone else see this behavior?
>>> >
>>> > While testing this, please set "async_file_debug" log flag.
>>> > This says when asynchronous file operations start and end.
>>> >
>>> > -- David
>>> >
>>> > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
>>> > > One thing that may need attention if very large files become
the norm is
>>> the
>>> > > single-threaded nature of some parts of the core client. My
1-hour CMS
>>> test has
>>> > > just finished, and a new 24-hour test started.
>>> > >
>>> > >
>>> > > I watched this happening, and part of the process is copying
a
1.33 GB
>>> initial
>>> > > .vmi image file (downloaded previously by BOINC from CERN)
from
the project
>>> > > directory to the slot directory. This took about 90 seconds:
during that
>>> time, all
>>> > > Manager updating stopped. I'm sure it's the copying process
which inhibited
>>> > > updates: I was watching the slot directory, and the .vmi
image
file had
>>> appeared,
>>> > > but other essential startup files hadn't.
>>> > >
>>> > >
>>> > > When BOINC regained its ability to communicate, three running
tasks had
>>> exited
>>> > > with the dreaded (and false) 'you may need to reset the
project' advice.
>>> inline
>>> > > log follows: because my last log got mangled by my ISP's new
mail
>>> interface, I'll
>>> > > attach it as a text file as well.
>>> > >
>>> > >
>>> > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home>
<mailto:LHC@home <mailto:LHC@home>> 1.0 | Computation for task
>>> > >
>>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
>>>
>>> > > finished
>>> > > 10/05/2015 20:12:56 | CMS-dev | Starting task
CMS_31107_1427806626.783437_0
>>> > > 10/05/2015 20:12:56 | CMS-dev | [cpu_sched] Starting task
>>> > > CMS_31107_1427806626.783437_0 using CMS version 4615 (vbox64)
in slot 7
>>> > > 10/05/2015 20:14:25 | climateprediction.net | Task
>>> > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero status
but no
>>> 'finished' file
>>> > > 10/05/2015 20:14:25 | climateprediction.net | If this happens
repeatedly
>>> you may
>>> > > need to reset the project.
>>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home> <mailto:NumberFields@home
<mailto:NumberFields@home>> | Task
>>> > > wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero status
but no
>>> 'finished' file
>>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home> <mailto:NumberFields@home
<mailto:NumberFields@home>> | If
>>> this happens repeatedly you may need
>>> > > to reset the project.
>>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
<mailto:SETI@home <mailto:SETI@home>> | Task
>>> 05jl12ab.3911.10292.438086664199.12.207_1
>>> > > exited with zero status but no 'finished' file
>>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
<mailto:SETI@home <mailto:SETI@home>> | If this happens
>>> repeatedly you may need to reset
>>> > > the project.
>>> > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched]
Restarting task
>>> > > hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz version
610 in slot 5
>>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home> <mailto:NumberFields@home
<mailto:NumberFields@home>> |
>>> [cpu_sched] Restarting task
>>> > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics version
200 in slot 0
>>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
<mailto:SETI@home <mailto:SETI@home>> | [cpu_sched]
>>> Restarting task
>>> > > 05jl12ab.3911.10292.438086664199.12.207_1 using setiathome_v7
version
>>> 700 (cuda42)
>>> > > in slot 2
>>> > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home>
<mailto:LHC@home <mailto:LHC@home>> 1.0 | Started upload of
>>> > >
>>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
>>> > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home>
<mailto:LHC@home <mailto:LHC@home>> 1.0 | Finished upload of
>>> > >
>>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Sunday, 10 May 2015, 19:59, Seke Rob <onec...@hotmail.com
<mailto:onec...@hotmail.com>
>>> <mailto:onec...@hotmail.com <mailto:onec...@hotmail.com>>> wrote:
>>> > >
>>> > >
>>> > >
>>> > > Excellent this is all fixed and tested. Interest is/was
that
WCG's Clean
>>> > > Energy at some point in time was to run very large models,
talk of
>>> 4-8GB IIRC.
>>> > >
>>> > > --SekeRob
>>> > >
>>> > > On May 10, 2015 20:27, Richard Haselgrove
>>> <r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>
<mailto:r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>>
>>> > > <mailto:r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>
>>> <mailto:r.haselgr...@btopenworld.com
<mailto:r.haselgr...@btopenworld.com>>>> wrote:
>>> > > CMS only has stock applications configured for delivery to
64-bit
>>> platforms.
>>> > > I've made an anonymous platform configuration using the
32-bit VBox
>>> Windows
>>> > > wrapper: it has downloaded and is running its first 1-hour
task. If that
>>> > > completes successfully (it seems to have reached the
>>> fully-operational stage),
>>> > > I'll try a full 24-hour task, which under current
operational
>>> circumstances
>>> > > should generate a >4 GB file locally.
>>> > >
>>> > >
>>> > > On Sunday, 10 May 2015, 18:28, David Anderson
>>> <da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
<mailto:da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>>
>>> > > <mailto:da...@ssl.berkeley.edu
<mailto:da...@ssl.berkeley.edu> <mailto:da...@ssl.berkeley.edu
<mailto:da...@ssl.berkeley.edu>>>> wrote:
>>> > >
>>> > >
>>> > >
>>> > > NTFS handles > 4GB files, even if the hardware and/or OS
is
only 32-bit.
>>> > > 32-bit versions of Windows have APIs (like _stat64()) for
handling >
>>> 4GB files.
>>> > > BOINC needs to use these; we fixed one place where it
wasn't.
>>> > >
>>> > > On Unix (Linux and Mac), BOINC uses the regular APIs (like
lseek())
>>> but is
>>> > > built with a
>>> > > -D_FILE_OFFSET_BITS=64 flag that causes these functions to
64-bit size.
>>> > > However, it's possible that BOINC has bugs involving > 4GB
files on
>>> Unix too.
>>> > > If anyone has a 32-bit Linux system, please test with the
CMS project.
>>> > >
>>> > > -- David
>>> > >
>>> > > On 10-May-2015 3:58 AM, --SekeRob wrote:
>>> > > >
>>> > > > Just wondering, with files over 4GB and a 64 bit lib
introduced, is
>>> it not a CMS
>>> > > > project requirement to run on a 64 bit OS?
>>> > > >
>>> > > >
>>> > >
>>> > > _______________________________________________
>>> > > boinc_alpha mailing list
>>> > > boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
<mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>
>>> <mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
<mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>>
>>> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> > > To unsubscribe, visit the above URL and
>>> > > (near bottom of page) enter your email address.
>>>
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > boinc_alpha mailing list
>>> > > boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
<mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>
>>> <mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
<mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>>
>>> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> > > To unsubscribe, visit the above URL and
>>> > > (near bottom of page) enter your email address.
>>> > >
>>> > >
>>> >
>>> > _______________________________________________
>>> > boinc_alpha mailing list
>>> > boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>
<mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>
>>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> > To unsubscribe, visit the above URL and
>>> > (near bottom of page) enter your email address.
>>>
>>> _______________________________________________
>>> boinc_alpha mailing list
>>> boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
<mailto:boinc_al...@ssl.berkeley.edu
<mailto:boinc_al...@ssl.berkeley.edu>>
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>>
>>>
>>
>
>
>
>
------------------------------------------------------------------------------------
> Avast logo <http://www.avast.com/>
>
> This email has been checked for viruses by Avast antivirus software.
> www.avast.com <http://www.avast.com> <http://www.avast.com/>
>
>
_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu <mailto:boinc_dev@ssl.berkeley.edu>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.