Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Richard Haselgrove Sat, 16 May 2015 08:05:07 -0700

Here is the message log file for a GPUGrid task finish. The 12-second delay 
appears again between 14:26:35 and 14:26:47 - that's after the slot directory 
has been cleared, and the exiting task has changed state from 'running' to 
'uploading'. Two new tasks have been assigned to the GPU, but their (small) 
startup files have not yet been linked to their respective slot directories.
I also attach directory listings for the slot and GPUGrid project folders at 
various stages of the cleanup: the slot held 34 files totalling 44,186,727 
bytes, which doesn't sound excessive: the largest file deletion (94,783,960 
bytes) occurred several minutes later, when that file finished uploading.
I'll enable similar logging and watch what happens when the next GPUGrid task 
starts up, but from memory, the disruption to BOINC is less severe at startup.



     On Tuesday, 12 May 2015, 23:29, David Anderson <da...@ssl.berkeley.edu> 
wrote:
   
 

 BTW: the client isn't completely single-threaded;
it uses a separate thread to do CPU throttling.
It would be feasible to also use separate threads
for serving GUI RPC connections,
which would allow client to remain responsive even while
e.g. copying thousands of files to a slot dir.
-- David

On 12-May-2015 2:40 AM, Seke Rob wrote:
> Reminds me of the Clean Energy Project, Phase 2 and why we have app_config 
> and 
> <max_concurrent> and a default control of allowing 1 'In Progress' on a host. 
> This 
> project sets up in slot copying near 6700 files [symlinking proposed long ago 
> as 
> is done on several other WCG projects for the static files]. If more than one 
> CEP2 
> is started the machine feels at times like a snail, responsiveness of the 
> BOINC 
> manager is poor, many a time the less powerful systems incurring error zero 
> status 
> exits or total fail. On an 8 core observed it could take over an hour before 
> actual computing commenced [CPU time logged]. Boot cycle requires manually 
> starting of tasks one by one. Kevin Reed few years ago raised a ticket for 
> staggered starting, where the models can reach several GB and bigger in the 
> coming. At any rate, as much as these 6700 files are copied, they also then 
> are 
> needing of deletion at completion [physical or symlink references]. The 
> effect of 
> starting 1 CEP2 and finishing / packaging / zipping and transmitting can 
> easily 
> lead to several minutes of there not being any computing, just whirring, for 
> minutes, just elapsed being logged. The more run the more the issue 
> compounds, 
> with the effect of what many incur, the exit zero status series, resetting to 
> start or last checkpoint with often hours of computing time lost.
>
> Maybe you'd like to get in touch with your confederates at WCG [Keith 
> Uplinger], 
> to discuss the issue further as this is now nearing a 5 year continues 
> frustration 
> [June 2010 launch, and a huge limitation on the speed of progress on this 
> project].
>
> --SekeRob.
>
> On 12-5-2015 1:55, David Anderson wrote:
>> That delay looks like it's caused by deleting files or by process cleanup.
>> Does GPUGrid make lots of (non-output) files in the slot dir?
>>
>> Please try to repro it with slot_debug, task_debug, and heartbeat_debug set
>> (gui_rpc_debug not needed).
>>
>> -- David
>>
>> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
>>> Here's another example of a case where BOINC finds that it can't walk and 
>>> chew 
>>> gum at the same time. The event of interest is
>>>
>>> 11/05/2015 18:35:34 | GPUGRID | Computation for task 
>>> e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished
>>>
>>> Following that, there's a 12-second interval where neither heartbeats nor 
>>> GUI 
>>> RPC traffic was logged: during that time, the Task tab of the Manager was 
>>> unchanging, not showing the regular update of elapsed time for running 
>>> tasks.
>>>
>>> async_file_debug was active at the time, but found no events to log.
>>>
>>> These particular GPUGrid tasks generate around 90 MB of upload files, but I 
>>> think they are generated directly in the project folder and don't need to 
>>> be 
>>> copied anywhere.
>>>
>>> Main log as attached file only.
>>>
>>> I'll catch a CMS-dev log later this evening, but after that, I'll be away 
>>> for a 
>>> few days and I'll have to leave the bug-chase until the weekend.
>>>
>>>
>>>
>>>
>>> On Monday, 11 May 2015, 9:42, Jacob Klein <jacob_w_kl...@msn.com> wrote:
>>>
>>>
>>>
>>>    I have seen this problem before, where the UI becomes unresponsive. If I
>>>    recall, it happens when a T4T task is being set up (ie: after everything 
>>>was
>>>    downloaded). For me, I don't recall the problem ever "screwing over other
>>>    tasks", though.
>>>
>>>    Try this to reproduce it: Attach to T4T, and get a task. It may take a 
>>>while
>>>    to do that download, so you can "step away" for a bit. Then, once that 
>>>task
>>>    is going, abort it. Downloading the 2nd task should be instantaneous
>>>    (nothing really to download), but instantiation of that 2nd task should
>>>    cause the UI to hang (showing the "Please wait" messagebox in the 
>>>manager).
>>>
>>>    Does that help?
>>>    > Date: Sun, 10 May 2015 23:19:24 -0700
>>>    > From: da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
>>>    > To: r.haselgr...@btopenworld.com <mailto:r.haselgr...@btopenworld.com>;
>>>    onec...@hotmail.com <mailto:onec...@hotmail.com>
>>>    > CC: boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
>>>    > Subject: Re: [boinc_alpha] BOINC re-using slot directories without
>>>    ensuring they're empty
>>>    >
>>>    > I did some initial testing and couldn't repro this;
>>>    > the client remains responsive while copying a 5 GB file to a slot dir.
>>>    > Does anyone else see this behavior?
>>>    >
>>>    > While testing this, please set "async_file_debug" log flag.
>>>    > This says when asynchronous file operations start and end.
>>>    >
>>>    > -- David
>>>    >
>>>    > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
>>>    > > One thing that may need attention if very large files become the 
>>>norm is
>>>    the
>>>    > > single-threaded nature of some parts of the core client. My 1-hour 
>>>CMS
>>>    test has
>>>    > > just finished, and a new 24-hour test started.
>>>    > >
>>>    > >
>>>    > > I watched this happening, and part of the process is copying a 1.33 
>>>GB
>>>    initial
>>>    > > .vmi image file (downloaded previously by BOINC from CERN) from the 
>>>project
>>>    > > directory to the slot directory. This took about 90 seconds: during 
>>>that
>>>    time, all
>>>    > > Manager updating stopped. I'm sure it's the copying process which 
>>>inhibited
>>>    > > updates: I was watching the slot directory, and the .vmi image file 
>>>had
>>>    appeared,
>>>    > > but other essential startup files hadn't.
>>>    > >
>>>    > >
>>>    > > When BOINC regained its ability to communicate, three running tasks 
>>>had
>>>    exited
>>>    > > with the dreaded (and false) 'you may need to reset the project' 
>>>advice.
>>>    inline
>>>    > > log follows: because my last log got mangled by my ISP's new mail
>>>    interface, I'll
>>>    > > attach it as a text file as well.
>>>    > >
>>>    > >
>>>    > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home> 1.0 | Computation 
>>>for task
>>>    > >
>>>    
>>>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
>>>
>>>    > > finished
>>>    > > 10/05/2015 20:12:56 | CMS-dev | Starting task 
>>>CMS_31107_1427806626.783437_0
>>>    > > 10/05/2015 20:12:56 | CMS-dev | [cpu_sched] Starting task
>>>    > > CMS_31107_1427806626.783437_0 using CMS version 4615 (vbox64) in 
>>>slot 7
>>>    > > 10/05/2015 20:14:25 | climateprediction.net | Task
>>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero status but no
>>>    'finished' file
>>>    > > 10/05/2015 20:14:25 | climateprediction.net | If this happens 
>>>repeatedly
>>>    you may
>>>    > > need to reset the project.
>>>    > > 10/05/2015 20:14:25 | NumberFields@home <mailto:NumberFields@home> | 
>>>Task
>>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero status but no
>>>    'finished' file
>>>    > > 10/05/2015 20:14:25 | NumberFields@home <mailto:NumberFields@home> | 
>>>If
>>>    this happens repeatedly you may need
>>>    > > to reset the project.
>>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> | Task
>>>    05jl12ab.3911.10292.438086664199.12.207_1
>>>    > > exited with zero status but no 'finished' file
>>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> | If this happens
>>>    repeatedly you may need to reset
>>>    > > the project.
>>>    > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched] Restarting 
>>>task
>>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz version 610 in 
>>>slot 5
>>>    > > 10/05/2015 20:14:25 | NumberFields@home <mailto:NumberFields@home> |
>>>    [cpu_sched] Restarting task
>>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics version 200 in 
>>>slot 0
>>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> | [cpu_sched]
>>>    Restarting task
>>>    > > 05jl12ab.3911.10292.438086664199.12.207_1 using setiathome_v7 version
>>>    700 (cuda42)
>>>    > > in slot 2
>>>    > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home> 1.0 | Started 
>>>upload of
>>>    > >
>>>    
>>>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
>>>    > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home> 1.0 | Finished 
>>>upload of
>>>    > >
>>>    
>>>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
>>>    > >
>>>    > >
>>>    > >
>>>    > >
>>>    > >
>>>    > > On Sunday, 10 May 2015, 19:59, Seke Rob <onec...@hotmail.com
>>>    <mailto:onec...@hotmail.com>> wrote:
>>>    > >
>>>    > >
>>>    > >
>>>    > >    Excellent this is all fixed and tested. Interest is/was that 
>>>WCG's Clean
>>>    > >    Energy at some point in time was to run very large models, talk of
>>>    4-8GB IIRC.
>>>    > >
>>>    > >    --SekeRob
>>>    > >
>>>    > >    On May 10, 2015 20:27, Richard Haselgrove
>>>    <r.haselgr...@btopenworld.com <mailto:r.haselgr...@btopenworld.com>
>>>    > >    <mailto:r.haselgr...@btopenworld.com
>>>    <mailto:r.haselgr...@btopenworld.com>>> wrote:
>>>    > >    CMS only has stock applications configured for delivery to 64-bit
>>>    platforms.
>>>    > >    I've made an anonymous platform configuration using the 32-bit 
>>>VBox
>>>    Windows
>>>    > >    wrapper: it has downloaded and is running its first 1-hour task. 
>>>If that
>>>    > >    completes successfully (it seems to have reached the
>>>    fully-operational stage),
>>>    > >    I'll try a full 24-hour task, which under current operational
>>>    circumstances
>>>    > >    should generate a >4 GB file locally.
>>>    > >
>>>    > >
>>>    > >        On Sunday, 10 May 2015, 18:28, David Anderson
>>>    <da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
>>>    > >    <mailto:da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>>> 
>>>wrote:
>>>    > >
>>>    > >
>>>    > >
>>>    > >    NTFS handles > 4GB files, even if the hardware and/or OS is only 
>>>32-bit.
>>>    > >    32-bit versions of Windows have APIs (like _stat64()) for 
>>>handling >
>>>    4GB files.
>>>    > >    BOINC needs to use these; we fixed one place where it wasn't.
>>>    > >
>>>    > >    On Unix (Linux and Mac), BOINC uses the regular APIs (like 
>>>lseek())
>>>    but is
>>>    > >    built with a
>>>    > >    -D_FILE_OFFSET_BITS=64 flag that causes these functions to 64-bit 
>>>size.
>>>    > >    However, it's possible that BOINC has bugs involving > 4GB files 
>>>on
>>>    Unix too.
>>>    > >    If anyone has a 32-bit Linux system, please test with the CMS 
>>>project.
>>>    > >
>>>    > >    -- David
>>>    > >
>>>    > >    On 10-May-2015 3:58 AM, --SekeRob wrote:
>>>    > >    >
>>>    > >    > Just wondering, with files over 4GB and a 64 bit lib 
>>>introduced, is
>>>    it not a CMS
>>>    > >    > project requirement to run on a 64 bit OS?
>>>    > >    >
>>>    > >    >
>>>    > >
>>>    > > _______________________________________________
>>>    > >    boinc_alpha mailing list
>>>    > > boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
>>>    <mailto:boinc_al...@ssl.berkeley.edu 
>>><mailto:boinc_al...@ssl.berkeley.edu>>
>>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>>    > >    To unsubscribe, visit the above URL and
>>>    > >    (near bottom of page) enter your email address.
>>>
>>>    > >
>>>    > >
>>>    > >
>>>    > >
>>>    > > _______________________________________________
>>>    > >    boinc_alpha mailing list
>>>    > > boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
>>>    <mailto:boinc_al...@ssl.berkeley.edu 
>>><mailto:boinc_al...@ssl.berkeley.edu>>
>>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>>    > >    To unsubscribe, visit the above URL and
>>>    > >    (near bottom of page) enter your email address.
>>>    > >
>>>    > >
>>>    >
>>>    > _______________________________________________
>>>    > boinc_alpha mailing list
>>>    > boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
>>>    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>>    > To unsubscribe, visit the above URL and
>>>    > (near bottom of page) enter your email address.
>>>
>>>    _______________________________________________
>>>    boinc_alpha mailing list
>>>    boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
>>>    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
>>>    To unsubscribe, visit the above URL and
>>>    (near bottom of page) enter your email address.
>>>
>>>
>>
>
>
>
> ------------------------------------------------------------------------------------
> Avast logo <http://www.avast.com/>     
>
> This email has been checked for viruses by Avast antivirus software.
> www.avast.com <http://www.avast.com/>
>
>

_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.


 
  
_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Reply via email to