> On Mar 8, 2022, at 3:35 PM, Ole Tange <o...@tange.dk> wrote:
>
> On Tue, Mar 8, 2022 at 11:22 PM Anderson, Stuart B. <s...@caltech.edu> wrote:
>>
>> parallel version 20190922 from EPEL 8 running a Rocky Linux 8 system
>> occasionally gets into an I/O spin loop writing 8193 bytes of "x" to a
>> deleted TMPDIR file and then immediately truncating it, e.g.,
>
> This is by design:
>
> https://www.gnu.org/software/parallel/parallel_design.html#disk-full
Any reason not to call statvfs() to see if there is free disk space?
> GNU Parallel does an exponential back off:
>
> https://www.gnu.org/software/parallel/parallel_design.html#exponentially-back-off
There seems to be a problem with this for some long running jobs,
perhaps limited to nested parallel (see below).
>
> So for long running jobs, this test is run once per second. For short
> jobs (e.g. parallel true ::: {1..1000}) it will be run much more
> often.
>
> I find it unlikely that the CPU usage is solely caused by the checking
> of disk full. You can verify this by changing:
>
> sub exit_if_disk_full() {
> return;
> ...
> }
>
> This will, however, remove the protection that checks if $TMPDIR runs
> full during a run (and in this case output may be wrong).
>
> I ran strace -tt:
>
> [pid 1000762] 23:54:08.486547 wait4(-1, [{WIFEXITED(s) &&
> WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 1000775
> [pid 1000762] 23:54:08.486782 write(7,
> "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 8192) = 8192
> [pid 1000762] 23:54:08.486875 write(7, "b", 1) = 1
> [pid 1000762] 23:54:08.486956 ftruncate(7, 0) = 0
>
> so the write+truncate+seek takes at most 406 microseconds. When run on
> my system `vmstat 1` shows that the write never reaches the disk,
> which is the goal.
>
>> Any thoughts on how to avoid this seemingly useless spin loop with high CPU
>> and I/O resource consumption?
>
> Can you show that there really is an I/O consumption? Do the writes
> actually reach your disk?
I can confirm this is not throttled once per second and leads to maxing
out the I/O on a fast NVMe device with very high CPU utilization. The
following example shows it happening twice per millisecond.
[root@zfs1 ~]# ps auxw w| grep 836133
root 836133 98.2 0.0 55488 17672 pts/6 R+ Mar08 1277:14
/usr/bin/perl /usr/bin/parallel --delay 2 -P /tmp/parallel.zfs.backup3 parallel
-I// syncoid {} // ::: $LIST
[root@zfs1 ~]# top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
741105 root 20 0 55508 17676 4388 R 98.7 0.0 1276:10 parallel
836133 root 20 0 55488 17672 4388 R 98.4 0.0 1278:08 parallel
[root@zfs1 ~]# strace -tt -p 836133 |& head -20
strace: Process 836133 attached
10:17:53.974396 write(10, "x", 1) = 1
10:17:53.974599 ftruncate(10, 0) = 0
10:17:53.974620 lseek(10, 0, SEEK_SET) = 0
10:17:53.974637 lseek(10, 0, SEEK_CUR) = 0
10:17:53.974654 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
10:17:53.974679 rt_sigaction(SIGCHLD, {sa_handler=0x7fe81f000dc0, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, {sa_handler=SIG_DFL,
sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, 8) = 0
10:17:53.974701 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
10:17:53.974722 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
10:17:53.974746 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
10:17:53.974763 rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, {sa_handler=0x7fe81f000dc0,
sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, 8) = 0
10:17:53.974783 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
10:17:53.974810 wait4(-1, 0x7ffc539e8a14, WNOHANG, NULL) = 0
10:17:53.974829 write(10, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
10:17:53.974849 write(10, "x", 1) = 1
10:17:53.974867 ftruncate(10, 0) = 0
10:17:53.974884 lseek(10, 0, SEEK_SET) = 0
10:17:53.974898 lseek(10, 0, SEEK_CUR) = 0
10:17:53.974915 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
10:17:53.974932 rt_sigaction(SIGCHLD, {sa_handler=0x7fe81f000dc0, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, {sa_handler=SIG_DFL,
sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, 8) = 0
Thanks.
--
Stuart Anderson
s...@caltech.edu