Package: apt-cacher-ng
Version: 3.7.4-1

Since quite many months I see occasionally (~ once a month) this cron job 
warning:
/etc/cron.daily/apt-cacher-ng:
Aborted
run-parts: /etc/cron.daily/apt-cacher-ng exited with return code 134

I originally assumed it was because of missing large-file-support (see bug 
#1020909),
but it turns out to be some kind of time-out.

To find the cause of this error I modified /etc/cron.daily/apt-cacher-ng like 
this
and had to wait some weeks until the error occured:
----------------------
echo "Debug info added to /etc/cron.daily/apt-cacher-ng"
echo "Helge added debug info to /etc/cron.daily/apt-cacher-ng to find cause for 
this: /etc/cron.daily/apt-cacher-ng exited with return code 134"

if [ -d /var/log/apt-cacher-ng ]; then
        echo "clean dir:  /var/log/apt-cacher-ng/"
        ls -la /var/log/apt-cacher-ng/
        echo "INODES"
        ls -li /var/log/apt-cacher-ng/
        echo "FIND 1:"
                # rm older than 10 days
                find /var/log/apt-cacher-ng -maxdepth 1 -name 
'maint_*.log.html.xz' -or -name 'maint_*.log.html' -type f -user apt-cacher-ng 
-mtime +10 -delete
        echo "XZ 1:"
                # compress older than 5 days
                find /var/log/apt-cacher-ng -maxdepth 1 -name 
'maint_*.log.html' -type f -user apt-cacher-ng -mtime +5 -print0 | xargs -r0 xz
        echo "clean dir done"
fi

rpage=$(/usr/lib/apt-cacher-ng/acngtool -c /etc/apt-cacher-ng 2>/dev/null 
printvar ReportPage)
echo "rpage is:   $rpage"
if test 0 = $(echo "$rpage" | wc -w)
then
        echo Warning: ReportPage setting of apt-cacher-ng was disabled by 
administrator. >&2
        echo See /usr/share/doc/apt-cacher-ng/README.Debian for details. >&2
        exit 42
fi

echo "Now start scngtool maint:"
strace -f /usr/lib/apt-cacher-ng/acngtool maint -c /etc/apt-cacher-ng 
SocketPath=/var/run/apt-cacher-ng/socket  2>&1
echo "done."
------------------------

The important part is the added "strace" at the end to see what 
"/usr/lib/apt-cacher-ng/acngtool maint" is doing.
I've added the full output of this faulty run to this bug report.
I think the interesting part is this strace output:
...
[pid  4533] read(9, "<b>Bringing index files up to da"..., 30000) = 46
[pid  4533] pselect6(10, [7 9], [], NULL, {tv_sec=17, tv_nsec=23000}, NULL) = 1 
(in [9], left {tv_sec=17, tv_nsec=23000})
[pid  4533] read(9, "\r\n", 30000)      = 2
[pid  4533] pselect6(10, [7 9], [], NULL, {tv_sec=17, tv_nsec=23000}, NULL) = 0 
(Timeout)
[pid  4533] shutdown(9, SHUT_RDWR)      = 0
[pid  4533] close(9)                    = 0
...
It seems acngtool receives "Bringing index files up to date" (?) and then it
times out ofter ~17 seconds and kills the connection and afterwards is killed
with SIGABRT:
...
[pid  4554] tgkill(4526, 4554, SIGABRT) = 0
[pid  4554] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=4526, 
si_uid=0} ---
[pid  4526] <... futex resumed>)        = ?
[pid  4554] +++ killed by SIGABRT +++
+++ killed by SIGABRT +++
...

My assumption is, that the ~ 17 seconds wait time is too short.
This machine is a debian buildd server for the hppa architecture.
It's much slower than x86 machines and sometimes has a high CPU load
when building packages. This can lead to slowly running cron jobs which then
probably triggers this kind of errors.

I briefly looked at the source code of apt-cacher-ng, and it seems many timeouts
can be configured in /etc/apt-cacher-ng/acng.conf.
Do you have a suggestion which of those is relevant for this cron job?
Could it be the nettimeout(17) as from src/acfg_defaults.cc, which can
be configured with the "NetworkTimeout" parameter?
If so, does it make sense to increase that value by default?

Another question would be:
If it really turns out to be a time-out issue, would it be possible to print
some better warning or error message instead of just SIGABRT ?

Thanks,
Helge

Attachment: apt-cacher-ng-trace.log.gz
Description: application/gzip

Reply via email to