Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

David Anderson Mon, 18 May 2015 12:58:50 -0700

That looks like what's needed.
Richard, if you can repro the inter-job delay,
you could try using Process Monitor to capture as much
as possible from the client during that period.
-- David


On 18-May-2015 11:12 AM, Jacob Klein wrote:

Process Monitor can be used to "watch the things a process does" (you have to setup correct filters, etc.)... but I'm not sure if that includes sleeps. But if theprocess is waiting on a file or something, though, it should be able to tell you.Worth looking into.


https://technet.microsoft.com/en-us/library/bb896645.aspx

Regards,
Jacob


------------------------------------------------------------------------------------
Date: Mon, 18 May 2015 10:41:16 -0700
From: da...@ssl.berkeley.edu
To: r.haselgr...@btopenworld.com; onec...@hotmail.com; jacob_w_kl...@msn.com
CC: boinc_dev@ssl.berkeley.edu

Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories withoutensuring they're empty


I looked at this and couldn't figure out the source of the 12-sec delay.
In general, delays could happen because
1) the client does something that takes a long time (like copying a 5 GB file)
2) the client sleeps (i.e. calls boinc_sleep()).
   It does this in a few situations,
   like backing off and retrying a file system operation.
But there's no indication that either of these is happening here.

Does Windows have a way of logging the system calls that a process makes
(like strace on Unix)?
If so that might reveal what the client is doing during those 12 seconds.

-- David

On 16-May-2015 8:01 AM, Richard Haselgrove wrote:

    Here is the message log file for a GPUGrid task finish. The 12-second delay
    appears again between 14:26:35 and 14:26:47 - that's after the slot 
directory
    has been cleared, and the exiting task has changed state from 'running' to
    'uploading'. Two new tasks have been assigned to the GPU, but their (small)
    startup files have not yet been linked to their respective slot directories.

    I also attach directory listings for the slot and GPUGrid project folders at
    various stages of the cleanup: the slot held 34 files totalling 44,186,727
    bytes, which doesn't sound excessive: the largest file deletion (94,783,960
    bytes) occurred several minutes later, when that file finished uploading.

    I'll enable similar logging and watch what happens when the next GPUGrid 
task
    starts up, but from memory, the disruption to BOINC is less severe at 
startup.



    On Tuesday, 12 May 2015, 23:29, David Anderson <da...@ssl.berkeley.edu>
    <mailto:da...@ssl.berkeley.edu> wrote:



        BTW: the client isn't completely single-threaded;
        it uses a separate thread to do CPU throttling.
        It would be feasible to also use separate threads
        for serving GUI RPC connections,
        which would allow client to remain responsive even while
        e.g. copying thousands of files to a slot dir.
        -- David

        On 12-May-2015 2:40 AM, Seke Rob wrote:
        > Reminds me of the Clean Energy Project, Phase 2 and why we have
        app_config and
        > <max_concurrent> and a default control of allowing 1 'In Progress' on 
a
        host. This
        > project sets up in slot copying near 6700 files [symlinking proposed
        long ago as
        > is done on several other WCG projects for the static files]. If more
        than one CEP2
        > is started the machine feels at times like a snail, responsiveness of
        the BOINC
        > manager is poor, many a time the less powerful systems incurring error
        zero status
        > exits or total fail. On an 8 core observed it could take over an hour
        before
        > actual computing commenced [CPU time logged]. Boot cycle requires 
manually
        > starting of tasks one by one. Kevin Reed few years ago raised a 
ticket for
        > staggered starting, where the models can reach several GB and bigger 
in the
        > coming. At any rate, as much as these 6700 files are copied, they also
        then are
        > needing of deletion at completion [physical or symlink references]. 
The
        effect of
        > starting 1 CEP2 and finishing / packaging / zipping and transmitting 
can
        easily
        > lead to several minutes of there not being any computing, just 
whirring,
        for
        > minutes, just elapsed being logged. The more run the more the issue
        compounds,
        > with the effect of what many incur, the exit zero status series,
        resetting to
        > start or last checkpoint with often hours of computing time lost.
        >
        > Maybe you'd like to get in touch with your confederates at WCG [Keith
        Uplinger],
        > to discuss the issue further as this is now nearing a 5 year continues
        frustration
        > [June 2010 launch, and a huge limitation on the speed of progress on
        this project].
        >
        > --SekeRob.
        >
        > On 12-5-2015 1:55, David Anderson wrote:
        >> That delay looks like it's caused by deleting files or by process 
cleanup.
        >> Does GPUGrid make lots of (non-output) files in the slot dir?
        >>
        >> Please try to repro it with slot_debug, task_debug, and 
heartbeat_debug set
        >> (gui_rpc_debug not needed).
        >>
        >> -- David
        >>
        >> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
        >>> Here's another example of a case where BOINC finds that it can't 
walk
        and chew
        >>> gum at the same time. The event of interest is
        >>>
        >>> 11/05/2015 18:35:34 | GPUGRID | Computation for task
        >>> e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished
        >>>
        >>> Following that, there's a 12-second interval where neither 
heartbeats
        nor GUI
        >>> RPC traffic was logged: during that time, the Task tab of the 
Manager was
        >>> unchanging, not showing the regular update of elapsed time for 
running
        tasks.
        >>>
        >>> async_file_debug was active at the time, but found no events to log.
        >>>
        >>> These particular GPUGrid tasks generate around 90 MB of upload 
files,
        but I
        >>> think they are generated directly in the project folder and don't 
need
        to be
        >>> copied anywhere.
        >>>
        >>> Main log as attached file only.
        >>>
        >>> I'll catch a CMS-dev log later this evening, but after that, I'll be
        away for a
        >>> few days and I'll have to leave the bug-chase until the weekend.
        >>>
        >>>
        >>>
        >>>
        >>> On Monday, 11 May 2015, 9:42, Jacob Klein <jacob_w_kl...@msn.com
        <mailto:jacob_w_kl...@msn.com>> wrote:
        >>>
        >>>
        >>>
        >>>    I have seen this problem before, where the UI becomes 
unresponsive.
        If I
        >>>    recall, it happens when a T4T task is being set up (ie: after
        everything was
        >>>    downloaded). For me, I don't recall the problem ever "screwing 
over
        other
        >>>    tasks", though.
        >>>
        >>>    Try this to reproduce it: Attach to T4T, and get a task. It may
        take a while
        >>>    to do that download, so you can "step away" for a bit. Then, once
        that task
        >>>    is going, abort it. Downloading the 2nd task should be 
instantaneous
        >>>    (nothing really to download), but instantiation of that 2nd task 
should
        >>>    cause the UI to hang (showing the "Please wait" messagebox in the
        manager).
        >>>
        >>>    Does that help?
        >>>    > Date: Sun, 10 May 2015 23:19:24 -0700
        >>>    > From: da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
        <mailto:da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>>
        >>>    > To: r.haselgr...@btopenworld.com
        <mailto:r.haselgr...@btopenworld.com> 
<mailto:r.haselgr...@btopenworld.com
        <mailto:r.haselgr...@btopenworld.com>>;
        >>> onec...@hotmail.com <mailto:onec...@hotmail.com>
        <mailto:onec...@hotmail.com <mailto:onec...@hotmail.com>>
        >>>    > CC: boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu> 
<mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu>>
        >>>    > Subject: Re: [boinc_alpha] BOINC re-using slot directories 
without
        >>>    ensuring they're empty
        >>>    >
        >>>    > I did some initial testing and couldn't repro this;
        >>>    > the client remains responsive while copying a 5 GB file to a 
slot
        dir.
        >>>    > Does anyone else see this behavior?
        >>>    >
        >>>    > While testing this, please set "async_file_debug" log flag.
        >>>    > This says when asynchronous file operations start and end.
        >>>    >
        >>>    > -- David
        >>>    >
        >>>    > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
        >>>    > > One thing that may need attention if very large files become
        the norm is
        >>>    the
        >>>    > > single-threaded nature of some parts of the core client. My
        1-hour CMS
        >>>    test has
        >>>    > > just finished, and a new 24-hour test started.
        >>>    > >
        >>>    > >
        >>>    > > I watched this happening, and part of the process is copying 
a
        1.33 GB
        >>>    initial
        >>>    > > .vmi image file (downloaded previously by BOINC from CERN) 
from
        the project
        >>>    > > directory to the slot directory. This took about 90 seconds:
        during that
        >>>    time, all
        >>>    > > Manager updating stopped. I'm sure it's the copying process
        which inhibited
        >>>    > > updates: I was watching the slot directory, and the .vmi 
image
        file had
        >>>    appeared,
        >>>    > > but other essential startup files hadn't.
        >>>    > >
        >>>    > >
        >>>    > > When BOINC regained its ability to communicate, three running
        tasks had
        >>>    exited
        >>>    > > with the dreaded (and false) 'you may need to reset the
        project' advice.
        >>>    inline
        >>>    > > log follows: because my last log got mangled by my ISP's new 
mail
        >>>    interface, I'll
        >>>    > > attach it as a text file as well.
        >>>    > >
        >>>    > >
        >>>    > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home>
        <mailto:LHC@home <mailto:LHC@home>> 1.0 | Computation for task
        >>>    > >
        >>>
        
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
        >>>
        >>>    > > finished
        >>>    > > 10/05/2015 20:12:56 | CMS-dev | Starting task
        CMS_31107_1427806626.783437_0
        >>>    > > 10/05/2015 20:12:56 | CMS-dev | [cpu_sched] Starting task
        >>>    > > CMS_31107_1427806626.783437_0 using CMS version 4615 (vbox64)
        in slot 7
        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | Task
        >>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero status 
but no
        >>>    'finished' file
        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | If this happens
        repeatedly
        >>>    you may
        >>>    > > need to reset the project.
        >>>    > > 10/05/2015 20:14:25 | NumberFields@home
        <mailto:NumberFields@home> <mailto:NumberFields@home
        <mailto:NumberFields@home>> | Task
        >>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero status 
but no
        >>>    'finished' file
        >>>    > > 10/05/2015 20:14:25 | NumberFields@home
        <mailto:NumberFields@home> <mailto:NumberFields@home
        <mailto:NumberFields@home>> | If
        >>>    this happens repeatedly you may need
        >>>    > > to reset the project.
        >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
        <mailto:SETI@home <mailto:SETI@home>> | Task
        >>> 05jl12ab.3911.10292.438086664199.12.207_1
        >>>    > > exited with zero status but no 'finished' file
        >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
        <mailto:SETI@home <mailto:SETI@home>> | If this happens
        >>>    repeatedly you may need to reset
        >>>    > > the project.
        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched]
        Restarting task
        >>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz version
        610 in slot 5
        >>>    > > 10/05/2015 20:14:25 | NumberFields@home
        <mailto:NumberFields@home> <mailto:NumberFields@home
        <mailto:NumberFields@home>> |
        >>>    [cpu_sched] Restarting task
        >>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics version
        200 in slot 0
        >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
        <mailto:SETI@home <mailto:SETI@home>> | [cpu_sched]
        >>>    Restarting task
        >>>    > > 05jl12ab.3911.10292.438086664199.12.207_1 using setiathome_v7
        version
        >>>    700 (cuda42)
        >>>    > > in slot 2
        >>>    > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home>
        <mailto:LHC@home <mailto:LHC@home>> 1.0 | Started upload of
        >>>    > >
        >>>
        
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
        >>>    > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home>
        <mailto:LHC@home <mailto:LHC@home>> 1.0 | Finished upload of
        >>>    > >
        >>>
        
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
        >>>    > >
        >>>    > >
        >>>    > >
        >>>    > >
        >>>    > >
        >>>    > > On Sunday, 10 May 2015, 19:59, Seke Rob <onec...@hotmail.com
        <mailto:onec...@hotmail.com>
        >>>    <mailto:onec...@hotmail.com <mailto:onec...@hotmail.com>>> wrote:
        >>>    > >
        >>>    > >
        >>>    > >
        >>>    > >    Excellent this is all fixed and tested. Interest is/was 
that
        WCG's Clean
        >>>    > >    Energy at some point in time was to run very large models,
        talk of
        >>>    4-8GB IIRC.
        >>>    > >
        >>>    > >    --SekeRob
        >>>    > >
        >>>    > >    On May 10, 2015 20:27, Richard Haselgrove
        >>>    <r.haselgr...@btopenworld.com 
<mailto:r.haselgr...@btopenworld.com>
        <mailto:r.haselgr...@btopenworld.com 
<mailto:r.haselgr...@btopenworld.com>>
        >>>    > >    <mailto:r.haselgr...@btopenworld.com
        <mailto:r.haselgr...@btopenworld.com>
        >>>    <mailto:r.haselgr...@btopenworld.com
        <mailto:r.haselgr...@btopenworld.com>>>> wrote:
        >>>    > >    CMS only has stock applications configured for delivery to
        64-bit
        >>>    platforms.
        >>>    > >    I've made an anonymous platform configuration using the
        32-bit VBox
        >>>    Windows
        >>>    > >    wrapper: it has downloaded and is running its first 1-hour
        task. If that
        >>>    > >    completes successfully (it seems to have reached the
        >>>    fully-operational stage),
        >>>    > >    I'll try a full 24-hour task, which under current 
operational
        >>>    circumstances
        >>>    > >    should generate a >4 GB file locally.
        >>>    > >
        >>>    > >
        >>>    > >        On Sunday, 10 May 2015, 18:28, David Anderson
        >>>    <da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
        <mailto:da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>>
        >>>    > >    <mailto:da...@ssl.berkeley.edu
        <mailto:da...@ssl.berkeley.edu> <mailto:da...@ssl.berkeley.edu
        <mailto:da...@ssl.berkeley.edu>>>> wrote:
        >>>    > >
        >>>    > >
        >>>    > >
        >>>    > >    NTFS handles > 4GB files, even if the hardware and/or OS 
is
        only 32-bit.
        >>>    > >    32-bit versions of Windows have APIs (like _stat64()) for
        handling >
        >>>    4GB files.
        >>>    > >    BOINC needs to use these; we fixed one place where it 
wasn't.
        >>>    > >
        >>>    > >    On Unix (Linux and Mac), BOINC uses the regular APIs (like
        lseek())
        >>>    but is
        >>>    > >    built with a
        >>>    > > -D_FILE_OFFSET_BITS=64 flag that causes these functions to
        64-bit size.
        >>>    > >    However, it's possible that BOINC has bugs involving > 4GB
        files on
        >>>    Unix too.
        >>>    > >    If anyone has a 32-bit Linux system, please test with the
        CMS project.
        >>>    > >
        >>>    > >    -- David
        >>>    > >
        >>>    > >    On 10-May-2015 3:58 AM, --SekeRob wrote:
        >>>    > >    >
        >>>    > >    > Just wondering, with files over 4GB and a 64 bit lib
        introduced, is
        >>>    it not a CMS
        >>>    > >    > project requirement to run on a 64 bit OS?
        >>>    > >    >
        >>>    > >    >
        >>>    > >
        >>>    > > _______________________________________________
        >>>    > >    boinc_alpha mailing list
        >>>    > > boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu> 
<mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu>>
        >>>    <mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu> 
<mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu>>>
        >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
        >>>    > >    To unsubscribe, visit the above URL and
        >>>    > >    (near bottom of page) enter your email address.
        >>>
        >>>    > >
        >>>    > >
        >>>    > >
        >>>    > >
        >>>    > > _______________________________________________
        >>>    > >    boinc_alpha mailing list
        >>>    > > boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu> 
<mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu>>
        >>>    <mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu> 
<mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu>>>
        >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
        >>>    > >    To unsubscribe, visit the above URL and
        >>>    > >    (near bottom of page) enter your email address.
        >>>    > >
        >>>    > >
        >>>    >
        >>>    > _______________________________________________
        >>>    > boinc_alpha mailing list
        >>>    > boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu> 
<mailto:boinc_al...@ssl.berkeley.edu
        <mailto:boinc_al...@ssl.berkeley.edu>>
        >>>    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
        >>>    > To unsubscribe, visit the above URL and
        >>>    > (near bottom of page) enter your email address.
        >>>
        >>> _______________________________________________
        >>>    boinc_alpha mailing list
        >>> boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
        <mailto:boinc_al...@ssl.berkeley.edu 
<mailto:boinc_al...@ssl.berkeley.edu>>
        >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
        >>>    To unsubscribe, visit the above URL and
        >>>    (near bottom of page) enter your email address.
        >>>
        >>>
        >>
        >
        >
        >
        >
        
------------------------------------------------------------------------------------
        > Avast logo <http://www.avast.com/>
        >
        > This email has been checked for viruses by Avast antivirus software.
        > www.avast.com <http://www.avast.com> <http://www.avast.com/>
        >
        >

        _______________________________________________
        boinc_dev mailing list
        boinc_dev@ssl.berkeley.edu <mailto:boinc_dev@ssl.berkeley.edu>
        http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev

        To unsubscribe, visit the above URL and
        (near bottom of page) enter your email address.


_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Reply via email to