Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

David Anderson Mon, 18 May 2015 12:44:58 -0700

Maybe, though there wasn't network communication after the delay.
What about the system-call tracing idea?


On 18-May-2015 12:07 PM, Rom Walton wrote:

The 12-second delay might be related to DNS name resolution.

On Windows, we disabled async name resolution in libcurl because of random 
crashes in the async code.

----- Rom

-----Original Message-----
From: boinc_dev [mailto:boinc_dev-boun...@ssl.berkeley.edu] On Behalf Of Jacob 
Klein
Sent: Monday, May 18, 2015 2:12 PM
To: David Anderson; Richard Haselgrove; Seke Rob
Cc: BOINC Development
Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without 
ensuring they're empty

Process Monitor can be used to "watch the things a process does" (you have to 
set up correct filters, etc.)... but I'm not sure if that includes sleeps. But if the 
process is waiting on a file or something, though, it should be able to tell you. Worth 
looking into.

https://technet.microsoft.com/en-us/library/bb896645.aspx

Regards,
Jacob
Date: Mon, 18 May 2015 10:41:16 -0700
From: da...@ssl.berkeley.edu
To: r.haselgr...@btopenworld.com; onec...@hotmail.com; jacob_w_kl...@msn.com
CC: boinc_dev@ssl.berkeley.edu
Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without 
ensuring they're empty

I looked at this and couldn't figure out the source of the 12-sec

       delay.

       In general, delays could happen because

       1) the client does something that takes a long time (like copying
       a 5 GB file)

       2) the client sleeps (i.e. calls boinc_sleep()).

          It does this in a few situations,

          like backing off and retrying a file system operation.

       But there's no indication that either of these is happening here.

       Does Windows have a way of logging the system calls that a process
       makes

       (like strace on Unix)?

       If so that might reveal what the client is doing during those 12
       seconds.

       -- David

     On 16-May-2015 8:01 AM, Richard
       Haselgrove wrote:

Here is the message

             log file for a GPUGrid task finish. The 12-second delay
             appears again between 14:26:35 and 14:26:47 - that's after
             the slot directory has been cleared, and the exiting task
             has changed state from 'running' to 'uploading'. Two new
             tasks have been assigned to the GPU, but their (small)
             startup files have not yet been linked to their respective
             slot directories.

I also attach

             directory listings for the slot and GPUGrid project folders
             at various stages of the cleanup: the slot held 34 files
             totalling 44,186,727 bytes, which doesn't sound excessive:
             the largest file deletion (94,783,960 bytes) occurred
             several minutes later, when that file finished uploading.

I'll enable similar

             logging and watch what happens when the next GPUGrid task
             starts up, but from memory, the disruption to BOINC is less
             severe at startup.

On Tuesday,

                   12 May 2015, 23:29, David Anderson
                   <da...@ssl.berkeley.edu> wrote:

                 BTW: the client isn't
                   completely single-threaded;
                   it uses a separate thread to do CPU throttling.
                   It would be feasible to also use separate threads
                   for serving GUI RPC connections,
                   which would allow client to remain responsive even
                   while
                   e.g. copying thousands of files to a slot dir.
                   -- David

On 12-May-2015 2:40 AM, Seke Rob wrote:

                   > Reminds me of the Clean Energy Project, Phase 2
                   and why we have app_config and
                   > <max_concurrent> and a default control of
                   allowing 1 'In Progress' on a host. This
                   > project sets up in slot copying near 6700 files
                   [symlinking proposed long ago as
                   > is done on several other WCG projects for the
                   static files]. If more than one CEP2
                   > is started the machine feels at times like a
                   snail, responsiveness of the BOINC
                   > manager is poor, many a time the less powerful
                   systems incurring error zero status
                   > exits or total fail. On an 8 core observed it
                   could take over an hour before
                   > actual computing commenced [CPU time logged].
                   Boot cycle requires manually
                   > starting of tasks one by one. Kevin Reed few
                   years ago raised a ticket for
                   > staggered starting, where the models can reach
                   several GB and bigger in the
                   > coming. At any rate, as much as these 6700 files
                   are copied, they also then are
                   > needing of deletion at completion [physical or
                   symlink references]. The effect of
                   > starting 1 CEP2 and finishing / packaging /
                   zipping and transmitting can easily
                   > lead to several minutes of there not being any
                   computing, just whirring, for
                   > minutes, just elapsed being logged. The more run
                   the more the issue compounds,
                   > with the effect of what many incur, the exit zero
                   status series, resetting to
                   > start or last checkpoint with often hours of
                   computing time lost.
                   >
                   > Maybe you'd like to get in touch with your
                   confederates at WCG [Keith Uplinger],
                   > to discuss the issue further as this is now
                   nearing a 5 year continues frustration
                   > [June 2010 launch, and a huge limitation on the
                   speed of progress on this project].
                   >
                   > --SekeRob.
                   >
                   > On 12-5-2015 1:55, David Anderson wrote:
                   >> That delay looks like it's caused by deleting
                   files or by process cleanup.
                   >> Does GPUGrid make lots of (non-output) files
                   in the slot dir?
                   >>
                   >> Please try to repro it with slot_debug,
                   task_debug, and heartbeat_debug set
                   >> (gui_rpc_debug not needed).
                   >>
                   >> -- David
                   >>
                   >> On 11-May-2015 10:54 AM, Richard Haselgrove
                   wrote:
                   >>> Here's another example of a case where
                   BOINC finds that it can't walk and chew
                   >>> gum at the same time. The event of
                   interest is
                   >>>
                   >>> 11/05/2015 18:35:34 | GPUGRID |
                   Computation for task
                   >>>
                   e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0
                   finished
                   >>>
                   >>> Following that, there's a 12-second
                   interval where neither heartbeats nor GUI
                   >>> RPC traffic was logged: during that time,
                   the Task tab of the Manager was
                   >>> unchanging, not showing the regular
                   update of elapsed time for running tasks.
                   >>>
                   >>> async_file_debug was active at the time,
                   but found no events to log.
                   >>>
                   >>> These particular GPUGrid tasks generate
                   around 90 MB of upload files, but I
                   >>> think they are generated directly in the
                   project folder and don't need to be
                   >>> copied anywhere.
                   >>>
                   >>> Main log as attached file only.
                   >>>
                   >>> I'll catch a CMS-dev log later this
                   evening, but after that, I'll be away for a
                   >>> few days and I'll have to leave the
                   bug-chase until the weekend.
                   >>>
                   >>>
                   >>>
                   >>>
                   >>> On Monday, 11 May 2015, 9:42, Jacob Klein
                   <jacob_w_kl...@msn.com>
                   wrote:
                   >>>
                   >>>
                   >>>
                   >>>    I have seen this problem before, where
                   the UI becomes unresponsive. If I
                   >>>    recall, it happens when a T4T task is
                   being set up (ie: after everything was
                   >>>    downloaded). For me, I don't recall
                   the problem ever "screwing over other
                   >>>    tasks", though.
                   >>>
                   >>>    Try this to reproduce it: Attach to
                   T4T, and get a task. It may take a while
                   >>>    to do that download, so you can "step
                   away" for a bit. Then, once that task
                   >>>    is going, abort it. Downloading the
                   2nd task should be instantaneous
                   >>>    (nothing really to download), but
                   instantiation of that 2nd task should
                   >>>    cause the UI to hang (showing the
                   "Please wait" messagebox in the manager).
                   >>>
                   >>>    Does that help?
                   >>>    > Date: Sun, 10 May 2015 23:19:24
                   -0700
                   >>>    > From: da...@ssl.berkeley.edu
                   <mailto:da...@ssl.berkeley.edu>
                   >>>    > To: r.haselgr...@btopenworld.com
                   <mailto:r.haselgr...@btopenworld.com>;
                   >>>    onec...@hotmail.com
                   <mailto:onec...@hotmail.com>
                   >>>    > CC: boinc_al...@ssl.berkeley.edu
                   <mailto:boinc_al...@ssl.berkeley.edu>
                   >>>    > Subject: Re: [boinc_alpha] BOINC
                   re-using slot directories without
                   >>>    ensuring they're empty
                   >>>    >
                   >>>    > I did some initial testing and
                   couldn't repro this;
                   >>>    > the client remains responsive
                   while copying a 5 GB file to a slot dir.
                   >>>    > Does anyone else see this
                   behavior?
                   >>>    >
                   >>>    > While testing this, please set
                   "async_file_debug" log flag.
                   >>>    > This says when asynchronous file
                   operations start and end.
                   >>>    >
                   >>>    > -- David
                   >>>    >
                   >>>    > On 10-May-2015 12:31 PM, Richard
                   Haselgrove wrote:
                   >>>    > > One thing that may need
                   attention if very large files become the norm is
                   >>>    the
                   >>>    > > single-threaded nature of
                   some parts of the core client. My 1-hour CMS
                   >>>    test has
                   >>>    > > just finished, and a new
                   24-hour test started.
                   >>>    > >
                   >>>    > >
                   >>>    > > I watched this happening,
                   and part of the process is copying a 1.33 GB
                   >>>    initial
                   >>>    > > .vmi image file (downloaded
                   previously by BOINC from CERN) from the project
                   >>>    > > directory to the slot
                   directory. This took about 90 seconds: during that
                   >>>    time, all
                   >>>    > > Manager updating stopped.
                   I'm sure it's the copying process which inhibited
                   >>>    > > updates: I was watching the
                   slot directory, and the .vmi image file had
                   >>>    appeared,
                   >>>    > > but other essential startup
                   files hadn't.
                   >>>    > >
                   >>>    > >
                   >>>    > > When BOINC regained its
                   ability to communicate, three running tasks had
                   >>>    exited
                   >>>    > > with the dreaded (and false)
                   'you may need to reset the project' advice.
                   >>>    inline
                   >>>    > > log follows: because my last
                   log got mangled by my ISP's new mail
                   >>>    interface, I'll
                   >>>    > > attach it as a text file as
                   well.
                   >>>    > >
                   >>>    > >
                   >>>    > > 10/05/2015 20:12:56 | LHC@home
                   <mailto:LHC@home>
                   1.0 | Computation for task
                   >>>    > >
                   >>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
                   >>>
                   >>>    > > finished
                   >>>    > > 10/05/2015 20:12:56 |
                   CMS-dev | Starting task CMS_31107_1427806626.783437_0
                   >>>    > > 10/05/2015 20:12:56 |
                   CMS-dev | [cpu_sched] Starting task
                   >>>    > >
                   CMS_31107_1427806626.783437_0 using CMS version 4615
                   (vbox64) in slot 7
                   >>>    > > 10/05/2015 20:14:25 |
                   climateprediction.net | Task
                   >>>    > >
                   hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero
                   status but no
                   >>>    'finished' file
                   >>>    > > 10/05/2015 20:14:25 |
                   climateprediction.net | If this happens repeatedly
                   >>>    you may
                   >>>    > > need to reset the project.
                   >>>    > > 10/05/2015 20:14:25 | NumberFields@home
                   <mailto:NumberFields@home>
                   | Task
                   >>>    > >
                   wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero
                   status but no
                   >>>    'finished' file
                   >>>    > > 10/05/2015 20:14:25 | NumberFields@home
                   <mailto:NumberFields@home>
                   | If
                   >>>    this happens repeatedly you may need
                   >>>    > > to reset the project.
                   >>>    > > 10/05/2015 20:14:25 | SETI@home
                   <mailto:SETI@home>
                   | Task
                   >>>
                   05jl12ab.3911.10292.438086664199.12.207_1
                   >>>    > > exited with zero status but
                   no 'finished' file
                   >>>    > > 10/05/2015 20:14:25 | SETI@home
                   <mailto:SETI@home>
                   | If this happens
                   >>>    repeatedly you may need to reset
                   >>>    > > the project.
                   >>>    > > 10/05/2015 20:14:25 |
                   climateprediction.net | [cpu_sched] Restarting task
                   >>>    > >
                   hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz
                   version 610 in slot 5
                   >>>    > > 10/05/2015 20:14:25 | NumberFields@home
                   <mailto:NumberFields@home>
                   |
                   >>>    [cpu_sched] Restarting task
                   >>>    > >
                   wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics
                   version 200 in slot 0
                   >>>    > > 10/05/2015 20:14:25 | SETI@home
                   <mailto:SETI@home>
                   | [cpu_sched]
                   >>>    Restarting task
                   >>>    > >
                   05jl12ab.3911.10292.438086664199.12.207_1 using
                   setiathome_v7 version
                   >>>    700 (cuda42)
                   >>>    > > in slot 2
                   >>>    > > 10/05/2015 20:14:27 | LHC@home
                   <mailto:LHC@home>
                   1.0 | Started upload of
                   >>>    > >
                   >>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
                   >>>    > > 10/05/2015 20:14:30 | LHC@home
                   <mailto:LHC@home>
                   1.0 | Finished upload of
                   >>>    > >
                   >>>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   >>>    > > On Sunday, 10 May 2015,
                   19:59, Seke Rob <onec...@hotmail.com
                   >>>    <mailto:onec...@hotmail.com>>
                   wrote:
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   >>>    > >    Excellent this is all
                   fixed and tested. Interest is/was that WCG's Clean
                   >>>    > >    Energy at some point in
                   time was to run very large models, talk of
                   >>>    4-8GB IIRC.
                   >>>    > >
                   >>>    > >    --SekeRob
                   >>>    > >
                   >>>    > >    On May 10, 2015 20:27,
                   Richard Haselgrove
                   >>>    <r.haselgr...@btopenworld.com
                   <mailto:r.haselgr...@btopenworld.com>
                   >>>    > >    <mailto:r.haselgr...@btopenworld.com
                   >>>    <mailto:r.haselgr...@btopenworld.com>>>
                   wrote:
                   >>>    > >    CMS only has stock
                   applications configured for delivery to 64-bit
                   >>>    platforms.
                   >>>    > >    I've made an anonymous
                   platform configuration using the 32-bit VBox
                   >>>    Windows
                   >>>    > >    wrapper: it has
                   downloaded and is running its first 1-hour task. If
                   that
                   >>>    > >    completes successfully
                   (it seems to have reached the
                   >>>    fully-operational stage),
                   >>>    > >    I'll try a full 24-hour
                   task, which under current operational
                   >>>    circumstances
                   >>>    > >    should generate a >4
                   GB file locally.
                   >>>    > >
                   >>>    > >
                   >>>    > >        On Sunday, 10 May
                   2015, 18:28, David Anderson
                   >>>    <da...@ssl.berkeley.edu
                   <mailto:da...@ssl.berkeley.edu>
                   >>>    > >    <mailto:da...@ssl.berkeley.edu
                   <mailto:da...@ssl.berkeley.edu>>>
                   wrote:
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   >>>    > >    NTFS handles > 4GB
                   files, even if the hardware and/or OS is only 32-bit.
                   >>>    > >    32-bit versions of
                   Windows have APIs (like _stat64()) for handling >
                   >>>    4GB files.
                   >>>    > >    BOINC needs to use these;
                   we fixed one place where it wasn't.
                   >>>    > >
                   >>>    > >    On Unix (Linux and Mac),
                   BOINC uses the regular APIs (like lseek())
                   >>>    but is
                   >>>    > >    built with a
                   >>>    > >    -D_FILE_OFFSET_BITS=64
                   flag that causes these functions to 64-bit size.
                   >>>    > >    However, it's possible
                   that BOINC has bugs involving > 4GB files on
                   >>>    Unix too.
                   >>>    > >    If anyone has a 32-bit
                   Linux system, please test with the CMS project.
                   >>>    > >
                   >>>    > >    -- David
                   >>>    > >
                   >>>    > >    On 10-May-2015 3:58 AM,
                   --SekeRob wrote:
                   >>>    > >    >
                   >>>    > >    > Just wondering, with
                   files over 4GB and a 64 bit lib introduced, is
                   >>>    it not a CMS
                   >>>    > >    > project requirement
                   to run on a 64 bit OS?
                   >>>    > >    >
                   >>>    > >    >
                   >>>    > >
                   >>>    > >
                   _______________________________________________
                   >>>    > >    boinc_alpha mailing list
                   >>>    > > boinc_al...@ssl.berkeley.edu
                   <mailto:boinc_al...@ssl.berkeley.edu>
                   >>>    <mailto:boinc_al...@ssl.berkeley.edu
                   <mailto:boinc_al...@ssl.berkeley.edu>>
                   >>>    > > 
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
                   >>>    > >    To unsubscribe, visit the
                   above URL and
                   >>>    > >    (near bottom of page)
                   enter your email address.
                   >>>
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   >>>    > >
                   _______________________________________________
                   >>>    > >    boinc_alpha mailing list
                   >>>    > > boinc_al...@ssl.berkeley.edu
                   <mailto:boinc_al...@ssl.berkeley.edu>
                   >>>    <mailto:boinc_al...@ssl.berkeley.edu
                   <mailto:boinc_al...@ssl.berkeley.edu>>
                   >>>    > > 
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
                   >>>    > >    To unsubscribe, visit the
                   above URL and
                   >>>    > >    (near bottom of page)
                   enter your email address.
                   >>>    > >
                   >>>    > >
                   >>>    >
                   >>>    >
                   _______________________________________________
                   >>>    > boinc_alpha mailing list
                   >>>    > boinc_al...@ssl.berkeley.edu
                   <mailto:boinc_al...@ssl.berkeley.edu>
                   >>>    > 
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
                   >>>    > To unsubscribe, visit the above
                   URL and
                   >>>    > (near bottom of page) enter your
                   email address.
                   >>>
                   >>>
                   _______________________________________________
                   >>>    boinc_alpha mailing list
                   >>>    boinc_al...@ssl.berkeley.edu
                   <mailto:boinc_al...@ssl.berkeley.edu>
                   >>>    
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
                   >>>    To unsubscribe, visit the above URL
                   and
                   >>>    (near bottom of page) enter your email
                   address.
                   >>>
                   >>>
                   >>
                   >
                   >
                   >
                   >
------------------------------------------------------------------------------------
                   > Avast logo <http://www.avast.com/>
                   >
                   > This email has been checked for viruses by Avast
                   antivirus software.
                   > www.avast.com <http://www.avast.com/>
                   >
                   >

_______________________________________________

                   boinc_dev mailing list
                   boinc_dev@ssl.berkeley.edu
                   http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev

To unsubscribe, visit the above URL and

                     (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.


_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Reply via email to