On Dec 28 2020, Ivan Shapovalov <[email protected]> wrote:
> On 2020-12-28 at 13:41 +0000, Nikolaus Rath wrote:
>> On Dec 28 2020, Ivan Shapovalov <[email protected]> wrote:
>> > 2020-12-27 19:04:33.819 211867 DEBUG    Thread-1
>> > s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 400  97
>> > 2020-12-27 19:04:33.820 211867 DEBUG    MainThread
>> > s3ql.block_cache.with_event_loop: upload of 8652 failed
>> > NoneType: None
>> > 2020-12-27 19:04:33.827 211867 DEBUG Thread-1 s3ql.mount.exchook:
>> > recording exception 400
>> > : bad_request - Checksum did not match data received
>> > zsh: terminated  mount.s3ql b2://<mybucket> /mnt/b2/files -o
>> > -- 8< --
>> > 
>> > Leaving out the question of why journald eats the last line, the
>> > situation is pretty clear. The backend (B2Backend._do_request)
>> > raises
>> > an exception (B2Error) which is not considered a "temporary
>> > failure".
>> > 
>> > I have just patched up error handling in the B2 backend to consider
>> > the
>> > checksum mismatch a transient failure (testing now).
>> 
>> Is B2 not using SSL for its data connection? That should make sure
>> that
>> there are no checksum errors....
>
> Indeed it does. I have added some proper exception logging and found
> the actual problem, which is — unsurprisingly — combination of user
> error, unclear system requirements and broken logging.

Great, thanks for your help!

Could you tell what exactly you changed to make the exception
information appear?


>
> The B2 backend creates a temporary file for each object that is being
> uploaded. My s3ql instance has object size = 1 GiB, and with threads=8,
> that means at most 8 GiB worth of temporary files at once. Thing is,
> temporary files are created in /tmp, which is a tmpfs and has a size
> limit.
>
> -- 8< --
> 2020-12-28 16:39:12.924 340652 ERROR    Thread-3 s3ql.mount.exchook: 
> Unhandled exception in thread, terminating
> Traceback (most recent call last):                                            
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 279, 
> in perform_write
>     return fn(fh)                                                             
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 334, in 
> do_write
>     fh.write(buf)                                                             
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 
> 36, in write
>     self.fh.write(buf)                                                        
>  
> OSError: [Errno 28] No space left on device
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):                                            
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/mount.py", line 58, in 
> run_with_except_hook
>     run_old(*args, **kw)                                                      
>  
>   File "/usr/lib/python3.9/threading.py", line 892, in run
>     self._target(*self._args, **self._kwargs)
>   File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 319, in 
> _upload_loop
>     self._do_upload(*tmp)                                                     
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 376, in 
> _do_upload
>     obj_size = backend.perform_write(do_write, 's3ql_data_%d'
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 108, 
> in wrapped
>     return method(*a, **kw)                                                   
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 279, 
> in perform_write
>     return fn(fh)                                                             
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 
> 79, in __exit__
>     self.close()


I consider this a bug in the B2 backend (and other backends may have the
same problem). If the backend returns an exception from write(), this
should not result in a second exception from close(). Either write()
should update the checksum to reflect the partial data that was written
(thus eliminating the checksum error on upload), or perhaps it should
set a flag that this object should not be uploaded at all on close.

https://github.com/s3ql/s3ql/issues/228


>   File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 108, 
> in wrapped
>     return method(*a, **kw)                                                   
>  
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 
> 64, in close
>     response = self.backend._do_upload_request(self.headers, self.fh)
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", 
> line 291, in _do_upload_request
>     response, response_body = self._do_request(upload_url_info['connection'], 
> 'POST', upload_url_info['path'], headers, body)
>   File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", 
> line 235, in _do_request
>     response = connection.read_response()
>   File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 790, in 
> read_response
>     return eval_coroutine(self.co_read_response(), self.timeout)
>   File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 1531, in 
> eval_coroutine
>     if not next(crt).poll(timeout=timeout):
>   File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 803, in 
> co_read_response
>     raise StateError('No pending requests')
> dugong.StateError: No pending requests                                        
>  
> zsh: terminated  mount.s3ql b2://<mybucket> /mnt/b2/files -o
> -- 8< --
>
> I guess this was causing the 400 as well, because we were sending the
> temporary file including the last partial write, but the locally
> computed hash did not include it.

Exactly.

> There is still a problem of getting the dugong.StateError instead of a
> 400. I don't know why it started giving me this, even after rolling
> back all the tentative patches. Though I'm sure it will be something
> trivial.

My guess is that dugong is expecting more data in the request (thus
the request is not considered complete and there is no response that
could be read). However, looking at the code, I do not immediately see
why this would be the case.


Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/s3ql/87o8idrlwl.fsf%40vostro.rath.org.

Reply via email to