Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Richard Haselgrove Mon, 18 May 2015 13:31:57 -0700

I'll give that a try. The job delay seems reproducible - I was expecting it 
from previous experience, which was why I set up logging in the first place. 
Two problems: with these big 11-hour to 17-hour jobs, they don't always finish 
at convenient times - the next one is due around 2 in the morning. And I'm not 
experienced enough with the process tools to be sure of getting the right 
filter at the first attempt. If any lurkers on this thread have any suggestions 
on what capture to enable, that would help - my personal suspicion is that we 
may have to watch the CUDA/driver runtime processes too, but maybe we'll try 
that later.
(I don't personally suspect DNS/libcurl, because it doesn't happen with jobs 
from all projects)



     On Monday, 18 May 2015, 20:57, David Anderson <da...@ssl.berkeley.edu> 
wrote:
   
 

  That looks like what's needed.
 Richard, if you can repro the inter-job delay,
 you could try using Process Monitor to capture as much
 as possible from the client during that period.
 -- David
 
 On 18-May-2015 11:12 AM, Jacob Klein wrote:
  
 #yiv7469427494 #yiv7469427494 --.yiv7469427494hmmessage 
P{margin:0px;padding:0px;}#yiv7469427494 
body.yiv7469427494hmmessage{font-size:12pt;font-family:Calibri;}#yiv7469427494  
Process Monitor can be used to "watch the things a process does" (you have to 
set up correct filters, etc.)... but I'm not sure if that includes sleeps. But 
if the process is waiting on a file or something, though, it should be able to 
tell you. Worth looking into.
 
 https://technet.microsoft.com/en-us/library/bb896645.aspx
 
 Regards,
 Jacob
 
 
  Date: Mon, 18 May 2015 10:41:16 -0700
 From: da...@ssl.berkeley.edu
 To: r.haselgr...@btopenworld.com; onec...@hotmail.com; jacob_w_kl...@msn.com
 CC: boinc_dev@ssl.berkeley.edu
 Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without 
ensuring they're empty
 
 I looked at this and couldn't figure out the source of the 12-sec delay.
 In general, delays could happen because
 1) the client does something that takes a long time (like copying a 5 GB file)
 2) the client sleeps (i.e. calls boinc_sleep()).
    It does this in a few situations,
    like backing off and retrying a file system operation.
 But there's no indication that either of these is happening here.
 
 Does Windows have a way of logging the system calls that a process makes
 (like strace on Unix)?
 If so that might reveal what the client is doing during those 12 seconds.
 
 -- David
 
 On 16-May-2015 8:01 AM, Richard Haselgrove wrote:
  
  Here is the message log file for a GPUGrid task finish. The 12-second delay 
appears again between 14:26:35 and 14:26:47 - that's after the slot directory 
has been cleared, and the exiting task has changed state from 'running' to 
'uploading'. Two new tasks have been assigned to the GPU, but their (small) 
startup files have not yet been linked to their respective slot directories. 
  I also attach directory listings for the slot and GPUGrid project folders at 
various stages of the cleanup: the slot held 34 files totalling 44,186,727 
bytes, which doesn't sound excessive: the largest file deletion (94,783,960 
bytes) occurred several minutes later, when that file finished uploading. 
  I'll enable similar logging and watch what happens when the next GPUGrid task 
starts up, but from memory, the disruption to BOINC is less severe at startup.  
 
 
       On Tuesday, 12 May 2015, 23:29, David Anderson <da...@ssl.berkeley.edu> 
wrote:
   
 
 
 BTW: the client isn't completely single-threaded;
 it uses a separate thread to do CPU throttling.
 It would be feasible to also use separate threads
 for serving GUI RPC connections,
 which would allow client to remain responsive even while
 e.g. copying thousands of files to a slot dir.
 -- David
 
 On 12-May-2015 2:40 AM, Seke Rob wrote:
 > Reminds me of the Clean Energy Project, Phase 2 and why we have app_config 
 > and 
 > <max_concurrent> and a default control of allowing 1 'In Progress' on a 
 > host. This 
 > project sets up in slot copying near 6700 files [symlinking proposed long 
 > ago as 
 > is done on several other WCG projects for the static files]. If more than 
 > one CEP2 
 > is started the machine feels at times like a snail, responsiveness of the 
 > BOINC 
 > manager is poor, many a time the less powerful systems incurring error zero 
 > status 
 > exits or total fail. On an 8 core observed it could take over an hour before 
 > actual computing commenced [CPU time logged]. Boot cycle requires manually 
 > starting of tasks one by one. Kevin Reed few years ago raised a ticket for 
 > staggered starting, where the models can reach several GB and bigger in the 
 > coming. At any rate, as much as these 6700 files are copied, they also then 
 > are 
 > needing of deletion at completion [physical or symlink references]. The 
 > effect of 
 > starting 1 CEP2 and finishing / packaging / zipping and transmitting can 
 > easily 
 > lead to several minutes of there not being any computing, just whirring, for 
 > minutes, just elapsed being logged. The more run the more the issue 
 > compounds, 
 > with the effect of what many incur, the exit zero status series, resetting 
 > to 
 > start or last checkpoint with often hours of computing time lost.
 >
 > Maybe you'd like to get in touch with your confederates at WCG [Keith 
 > Uplinger], 
 > to discuss the issue further as this is now nearing a 5 year continues 
 > frustration 
 > [June 2010 launch, and a huge limitation on the speed of progress on this 
 > project].
 >
 > --SekeRob.
 >
 > On 12-5-2015 1:55, David Anderson wrote:
 >> That delay looks like it's caused by deleting files or by process cleanup.
 >> Does GPUGrid make lots of (non-output) files in the slot dir?
 >>
 >> Please try to repro it with slot_debug, task_debug, and heartbeat_debug set
 >> (gui_rpc_debug not needed).
 >>
 >> -- David
 >>
 >> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
 >>> Here's another example of a case where BOINC finds that it can't walk and 
 >>> chew 
 >>> gum at the same time. The event of interest is
 >>>
 >>> 11/05/2015 18:35:34 | GPUGRID | Computation for task 
 >>> e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished
 >>>
 >>> Following that, there's a 12-second interval where neither heartbeats nor 
 >>> GUI 
 >>> RPC traffic was logged: during that time, the Task tab of the Manager was 
 >>> unchanging, not showing the regular update of elapsed time for running 
 >>> tasks.
 >>>
 >>> async_file_debug was active at the time, but found no events to log.
 >>>
 >>> These particular GPUGrid tasks generate around 90 MB of upload files, but 
 >>> I 
 >>> think they are generated directly in the project folder and don't need to 
 >>> be 
 >>> copied anywhere.
 >>>
 >>> Main log as attached file only.
 >>>
 >>> I'll catch a CMS-dev log later this evening, but after that, I'll be away 
 >>> for a 
 >>> few days and I'll have to leave the bug-chase until the weekend.
 >>>
 >>>
 >>>
 >>>
 >>> On Monday, 11 May 2015, 9:42, Jacob Klein <jacob_w_kl...@msn.com> wrote:
 >>>
 >>>
 >>>
 >>>    I have seen this problem before, where the UI becomes unresponsive. If I
 >>>    recall, it happens when a T4T task is being set up (ie: after 
 >>>everything was
 >>>    downloaded). For me, I don't recall the problem ever "screwing over 
 >>>other
 >>>    tasks", though.
 >>>
 >>>    Try this to reproduce it: Attach to T4T, and get a task. It may take a 
 >>>while
 >>>    to do that download, so you can "step away" for a bit. Then, once that 
 >>>task
 >>>    is going, abort it. Downloading the 2nd task should be instantaneous
 >>>    (nothing really to download), but instantiation of that 2nd task should
 >>>    cause the UI to hang (showing the "Please wait" messagebox in the 
 >>>manager).
 >>>
 >>>    Does that help?
 >>>    > Date: Sun, 10 May 2015 23:19:24 -0700
 >>>    > From: da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
 >>>    > To: r.haselgr...@btopenworld.com 
 >>><mailto:r.haselgr...@btopenworld.com>;
 >>>    onec...@hotmail.com <mailto:onec...@hotmail.com>
 >>>    > CC: boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
 >>>    > Subject: Re: [boinc_alpha] BOINC re-using slot directories without
 >>>    ensuring they're empty
 >>>    >
 >>>    > I did some initial testing and couldn't repro this;
 >>>    > the client remains responsive while copying a 5 GB file to a slot dir.
 >>>    > Does anyone else see this behavior?
 >>>    >
 >>>    > While testing this, please set "async_file_debug" log flag.
 >>>    > This says when asynchronous file operations start and end.
 >>>    >
 >>>    > -- David
 >>>    >
 >>>    > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
 >>>    > > One thing that may need attention if very large files become the 
 >>>norm is
 >>>    the
 >>>    > > single-threaded nature of some parts of the core client. My 1-hour 
 >>>CMS
 >>>    test has
 >>>    > > just finished, and a new 24-hour test started.
 >>>    > >
 >>>    > >
 >>>    > > I watched this happening, and part of the process is copying a 1.33 
 >>>GB
 >>>    initial
 >>>    > > .vmi image file (downloaded previously by BOINC from CERN) from the 
 >>>project
 >>>    > > directory to the slot directory. This took about 90 seconds: during 
 >>>that
 >>>    time, all
 >>>    > > Manager updating stopped. I'm sure it's the copying process which 
 >>>inhibited
 >>>    > > updates: I was watching the slot directory, and the .vmi image file 
 >>>had
 >>>    appeared,
 >>>    > > but other essential startup files hadn't.
 >>>    > >
 >>>    > >
 >>>    > > When BOINC regained its ability to communicate, three running tasks 
 >>>had
 >>>    exited
 >>>    > > with the dreaded (and false) 'you may need to reset the project' 
 >>>advice.
 >>>    inline
 >>>    > > log follows: because my last log got mangled by my ISP's new mail
 >>>    interface, I'll
 >>>    > > attach it as a text file as well.
 >>>    > >
 >>>    > >
 >>>    > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home> 1.0 | Computation 
 >>>for task
 >>>    > >
 >>>   
 >>>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
 >>>
 >>>    > > finished
 >>>    > > 10/05/2015 20:12:56 | CMS-dev | Starting task 
 >>>CMS_31107_1427806626.783437_0
 >>>    > > 10/05/2015 20:12:56 | CMS-dev | [cpu_sched] Starting task
 >>>    > > CMS_31107_1427806626.783437_0 using CMS version 4615 (vbox64) in 
 >>>slot 7
 >>>    > > 10/05/2015 20:14:25 | climateprediction.net | Task
 >>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero status but no
 >>>    'finished' file
 >>>    > > 10/05/2015 20:14:25 | climateprediction.net | If this happens 
 >>>repeatedly
 >>>    you may
 >>>    > > need to reset the project.
 >>>    > > 10/05/2015 20:14:25 | NumberFields@home <mailto:NumberFields@home> 
 >>>| Task
 >>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero status but no
 >>>    'finished' file
 >>>    > > 10/05/2015 20:14:25 | NumberFields@home <mailto:NumberFields@home> 
 >>>| If
 >>>    this happens repeatedly you may need
 >>>    > > to reset the project.
 >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> | Task
 >>>    05jl12ab.3911.10292.438086664199.12.207_1
 >>>    > > exited with zero status but no 'finished' file
 >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> | If this happens
 >>>    repeatedly you may need to reset
 >>>    > > the project.
 >>>    > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched] 
 >>>Restarting task
 >>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz version 610 
 >>>in slot 5
 >>>    > > 10/05/2015 20:14:25 | NumberFields@home <mailto:NumberFields@home> |
 >>>    [cpu_sched] Restarting task
 >>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics version 200 in 
 >>>slot 0
 >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> | [cpu_sched]
 >>>    Restarting task
 >>>    > > 05jl12ab.3911.10292.438086664199.12.207_1 using setiathome_v7 
 >>>version
 >>>    700 (cuda42)
 >>>    > > in slot 2
 >>>    > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home> 1.0 | Started 
 >>>upload of
 >>>    > >
 >>>   
 >>>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
 >>>    > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home> 1.0 | Finished 
 >>>upload of
 >>>    > >
 >>>   
 >>>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
 >>>    > >
 >>>    > >
 >>>    > >
 >>>    > >
 >>>    > >
 >>>    > > On Sunday, 10 May 2015, 19:59, Seke Rob <onec...@hotmail.com
 >>>    <mailto:onec...@hotmail.com>> wrote:
 >>>    > >
 >>>    > >
 >>>    > >
 >>>    > >    Excellent this is all fixed and tested. Interest is/was that 
 >>>WCG's Clean
 >>>    > >    Energy at some point in time was to run very large models, talk 
 >>>of
 >>>    4-8GB IIRC.
 >>>    > >
 >>>    > >    --SekeRob
 >>>    > >
 >>>    > >    On May 10, 2015 20:27, Richard Haselgrove
 >>>    <r.haselgr...@btopenworld.com <mailto:r.haselgr...@btopenworld.com>
 >>>    > >    <mailto:r.haselgr...@btopenworld.com
 >>>    <mailto:r.haselgr...@btopenworld.com>>> wrote:
 >>>    > >    CMS only has stock applications configured for delivery to 64-bit
 >>>    platforms.
 >>>    > >    I've made an anonymous platform configuration using the 32-bit 
 >>>VBox
 >>>    Windows
 >>>    > >    wrapper: it has downloaded and is running its first 1-hour task. 
 >>>If that
 >>>    > >    completes successfully (it seems to have reached the
 >>>    fully-operational stage),
 >>>    > >    I'll try a full 24-hour task, which under current operational
 >>>    circumstances
 >>>    > >    should generate a >4 GB file locally.
 >>>    > >
 >>>    > >
 >>>    > >        On Sunday, 10 May 2015, 18:28, David Anderson
 >>>    <da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>
 >>>    > >    <mailto:da...@ssl.berkeley.edu <mailto:da...@ssl.berkeley.edu>>> 
 >>>wrote:
 >>>    > >
 >>>    > >
 >>>    > >
 >>>    > >    NTFS handles > 4GB files, even if the hardware and/or OS is only 
 >>>32-bit.
 >>>    > >    32-bit versions of Windows have APIs (like _stat64()) for 
 >>>handling >
 >>>    4GB files.
 >>>    > >    BOINC needs to use these; we fixed one place where it wasn't.
 >>>    > >
 >>>    > >    On Unix (Linux and Mac), BOINC uses the regular APIs (like 
 >>>lseek())
 >>>    but is
 >>>    > >    built with a
 >>>    > >    -D_FILE_OFFSET_BITS=64 flag that causes these functions to 
 >>>64-bit size.
 >>>    > >    However, it's possible that BOINC has bugs involving > 4GB files 
 >>>on
 >>>    Unix too.
 >>>    > >    If anyone has a 32-bit Linux system, please test with the CMS 
 >>>project.
 >>>    > >
 >>>    > >    -- David
 >>>    > >
 >>>    > >    On 10-May-2015 3:58 AM, --SekeRob wrote:
 >>>    > >    >
 >>>    > >    > Just wondering, with files over 4GB and a 64 bit lib 
 >>>introduced, is
 >>>    it not a CMS
 >>>    > >    > project requirement to run on a 64 bit OS?
 >>>    > >    >
 >>>    > >    >
 >>>    > >
 >>>    > > _______________________________________________
 >>>    > >    boinc_alpha mailing list
 >>>    > > boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
 >>>    <mailto:boinc_al...@ssl.berkeley.edu 
 >>><mailto:boinc_al...@ssl.berkeley.edu>>
 >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >>>    > >    To unsubscribe, visit the above URL and
 >>>    > >    (near bottom of page) enter your email address.
 >>>
 >>>    > >
 >>>    > >
 >>>    > >
 >>>    > >
 >>>    > > _______________________________________________
 >>>    > >    boinc_alpha mailing list
 >>>    > > boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
 >>>    <mailto:boinc_al...@ssl.berkeley.edu 
 >>><mailto:boinc_al...@ssl.berkeley.edu>>
 >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >>>    > >    To unsubscribe, visit the above URL and
 >>>    > >    (near bottom of page) enter your email address.
 >>>    > >
 >>>    > >
 >>>    >
 >>>    > _______________________________________________
 >>>    > boinc_alpha mailing list
 >>>    > boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
 >>>    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >>>    > To unsubscribe, visit the above URL and
 >>>    > (near bottom of page) enter your email address.
 >>>
 >>>    _______________________________________________
 >>>    boinc_alpha mailing list
 >>>    boinc_al...@ssl.berkeley.edu <mailto:boinc_al...@ssl.berkeley.edu>
 >>>    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >>>    To unsubscribe, visit the above URL and
 >>>    (near bottom of page) enter your email address.
 >>>
 >>>
 >>
 >
 >
 >
 >------------------------------------------------------------------------------------
 > Avast logo <http://www.avast.com/>     
 >
 > This email has been checked for viruses by Avast antivirus software.
 > www.avast.com <http://www.avast.com/>
 >
 >
 
 _______________________________________________
 boinc_dev mailing list
 boinc_dev@ssl.berkeley.edu
 http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev 
 To unsubscribe, visit the above URL and
 (near bottom of page) enter your email address.
  
 
  
     
 
   
 
 

 
  
_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Reply via email to