Bug#986749: Bad file descriptor / failed to move stale items / storage error / hash mismatch and other

2021-05-30 Thread Eduard Bloch
notfound 986749 3.6.4-1
notfound 986749 3.7-1
severity 986749 serious
thanks

On Thu, 15 Apr 2021 13:49:22 + FUSTE Emmanuel 
 wrote:
> Hello,
>
> I have similar problems exacerbated by tree level of "forcemanaged=1" of
> apt cacher servers behind a blucoat proxy.
> Somes are VM, somes are physical. All machines / OS are ok.
> My conf use VfileUseRangeOps:-1 and ResuseConnections:0
> Trashing all the caches on all the servers even does not completely cure
> the problem which reappear shortly.
> Client concurrency activity worsen/trigguer the problem very very fast.
> Smell like treading problems.
> Will activate Debug: 7 and report here if I see something interesting.

After excessive testing, I am pretty sure that the root cause of this problem 
was solved
in the commit 
https://salsa.debian.org/blade/apt-cacher-ng/-/commit/c333cf3829e6373bcad07c831436317a7c34fac1
or for Sid (hopefully unblocked...):
https://salsa.debian.org/blade/apt-cacher-ng/-/commit/2afc3d384b2c051f2754730ed392ea5381f854f1

The other aspects with stale storage items (file recreation) were
already tackled in versions 3.6.2 und 3.6.3.

Your guess was not bad, the Bad-File-Descriptor problem was related to
concurrency issues but the error path was not trivial.

First, there was buggy usage of a RAII helper (unique_fd) which was
added as an afterthought in the commit:
0c02c1a0 (Eduard Bloch 2019-11-23 11:46:20 +0100
This was never used correctly though, the extra member in the class was
only for "design beauty" (uniformity) and is basically not used, but it
was interfering with the existing method for graceful connection
shutdown (see destructor). So actually after that change the socket was
closed ASAP and NOT graceful (risking loss of the final bytes of the
active TCP stream) which is an issue of its own, and then the delayed
closer code (see sockio.cc) came along and tried to close this socket
again, which killed random streams depending on the timing. This was not
obvious with a fast server and a few clients but with some load, this
becomes a real problem.

Then, another problem was the graceful-closing code itself. It was not
thread-safe but it was called from multi-threaded context via the
FinishConnection method in conserver.cc. This is now fixed by posting
the scheduling task into the IO thread. I'd also consider the code
inefficient and error-prone because it was using a hashmap for a purpose
where simply allocating the metadata nodes and releasing them is totally
sufficient and probably cheaper. So I rewrote this mess in sockio.cc
some weeks ago and current code seems to behave stable. I.e. no socket
or memory leaks spotted since then.

Another minor issue which caught my eye was the forceclose() helper
method, which was written in a sloppy way many years ago, and which
might call close(-1) once in a while. Which is not a drama but
pointless. The method is now dropped in the Unstable commit (see above),
it was hardly used anyway.

Best regards,
Eduard.

--
Erst wenn der letzte Programmierer eingesperrt und die letzte 
Idee patentiert ist, werdet ihr merken, daß Anwälte nicht programmieren können



Bug#986749: Bad file descriptor / failed to move stale items / storage error / hash mismatch and other

2021-04-15 Thread FUSTE Emmanuel
Hello,

I have similar problems exacerbated by tree level of "forcemanaged=1" of 
apt cacher servers behind a blucoat proxy.
Somes are VM, somes are physical. All machines / OS are ok.
My conf use VfileUseRangeOps:-1 and ResuseConnections:0
Trashing all the caches on all the servers even does not completely cure 
the problem which reappear shortly.
Client concurrency activity worsen/trigguer the problem very very fast. 
Smell like treading problems.
Will activate Debug: 7 and report here if I see something interesting.

Emmanuel.