On 2020-12-27 at 11:44 +0000, Nikolaus Rath wrote:
> <snip>

Thank you for the reply, Nikolaus.

> What does your kernel log say at this time (dmesg)?
> 
> Could it be that you're running out of memory, and the OOM killer is
> killing mount.s3ql to free up memory?

Kernel log is silent. It's definitely not an OOM (an OOM would have
SIGKILLed s3ql anyway).

> The TERM signal does not make sense to me, to this a non-fatal signal
> that should result in S3QL gracefully exiting.
> 
> 
> Could you try what happens when you manually send SIGTERM to a
> running
> mount.s3ql process? Does it terminate properly with full logging
> until
> the end?

Nope. It dies immediately. Which is sort of expected, because I
actually see no SIGTERM handler in s3ql.

And, on that matter, I see where it comes from. See below.

> So, in summary:
> 
> - Run standalone under gdb (and not as a systemd service)
> - Check kernel logs
> - Check memory usage
> - Try to send SIGTERM to a non-problematic mount

OK, so I did not yet try to run s3ql under gdb, but I think I
(partially) know what happens.

Running mount.s3ql in a plain shell session:

-- 8< --
mount.s3ql b2://<mybucket> /mnt/b2/files -o 
fg,log=none,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,debug,allow-other,compress=none,cachesize=10485760,threads=8,keep-cache,backend-options=disable-versions
-- 8< --

Produces this log:

-- 8< --
2020-12-27 19:04:33.819 211867 DEBUG    Thread-1 
s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 400  97
2020-12-27 19:04:33.820 211867 DEBUG    MainThread 
s3ql.block_cache.with_event_loop: upload of 8652 failed
NoneType: None
2020-12-27 19:04:33.827 211867 DEBUG    Thread-1 s3ql.mount.exchook: recording 
exception 400 : bad_request - Checksum did not match data received
zsh: terminated  mount.s3ql b2://<mybucket> /mnt/b2/files -o
-- 8< --

Leaving out the question of why journald eats the last line, the
situation is pretty clear. The backend (B2Backend._do_request) raises
an exception (B2Error) which is not considered a "temporary failure".
It bubbles all the way through ObjectW.close(),
AbstractBackend.perform_write(), BlockCache._do_upload(),
BlockCache._upload_loop() and is never caught.

Finally, exchook() from mount.py:setup_exchook() gets called and sends
SIGTERM to the mount process (mount.py:687).

Does that sound plausible?

I have just patched up error handling in the B2 backend to consider the
checksum mismatch a transient failure (testing now). But I take it the
whole SIGTERM thing is also unexpected?

-- 
Ivan Shapovalov / intelfx /

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/s3ql/538bf8da278012fd83d37b127c835ee67e8a3c06.camel%40intelfx.name.

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to