Hello,

Ok I found it!! (Or at least, one of them).
And a sure way to reproduce the crash.
All this time I was looking at a async/await bug... but it was finally 
another kind of concurrency issue :-)

The bug :
If you move a file from a folder to another, and the inode cache happens to 
be in a very specific condition (file to be copied uncached, destination 
folder in cache about to be evicted on next inode cache miss, origin folder 
in cache about to be evicted next after).
--> fs.py:654 Reduce a lot the occurence of this bug by ensuring the origin 
and destination folder are in inode cache
--> fs.py:658 self._lookup() load the inode of the file to be copied, 
evicting the destination folder inode.
--> fs.py:719 self.inodes[id_p_old] Load the origin folder which was still 
in cache
--> fs.py:720 self.inodes[id_p_new] Load the destination folder, evicts the 
origin folder inode.
--> fs.py:721/723 Changes are made on the origin folder inode, which is no 
longer handled by the inode cache. We have the bug.

Fix here : https://github.com/s3ql/s3ql/pull/199

Thanks for the pointers!

Nicolas Deschildre

On Friday, September 11, 2020 at 8:50:24 AM UTC+2 Nicolas Deschildre wrote:

> Hello,
>
> Thanks, Ok, I now understand the code better.
> InodeCache holds 100 Inodes as cache in a ring. When a new inode not in 
> cache is requested, an inode in the cache is flushed, and the new inode is 
> stored in the cache instead.
> The bug : Race condition : 2 inodes are requested from InodeCache at the 
> same time. Thread 1 requesting inode 1 flush and remove inode 2 from cache. 
> Thread 2 got inode 2 before it was removed from cache by Thread 1, but 
> makes changes after it was removed and flushed by Thread 1. Thead 2 ends, 
> there are no longer references to inode 2, python garbage collect it, and 
> this trigger the bug.
>
> I see 2 possibles solutions : 
> 1/ https://github.com/s3ql/s3ql/pull/196 : The _Inode class keeps a 
> reference to the InodeCache. On __del__, if we encouter the above bug, we 
> flush ourselves. The problem (and I'm not familiar enough with python) : I 
> guess garbage collection could happen at shutdown, when the InodeCache SQL 
> connection is no longer valid. Do you see a way to make this approach work?
> 2/ (Not coded yet) : The InodeCache is a ring, first in first out. What if 
> we store access time on InodeCache.__getitem__, and the inode to be removed 
> is the most old accessed one? This solution should reduce a lot (but not 
> eliminate) the race condition. What do you think?
>
> Finally : I tried to reproduce locally, with a unencrypted, uncompressed 
> local backend, with mass parralel attribute changes, but I was not able to 
> reproduce the bug. 
>
> Thanks,
> Nicolas Deschildre
>
> On Wednesday, September 9, 2020 at 5:50:19 PM UTC+2 [email protected] 
> wrote:
>
>> Hello Nicolas,
>>  
>>
>> S3QL somehow manages to delete/garbage collects an _Inode object that is 
>>> dirty (i.e. has an attribute modification that is not persisted in the 
>>> S3QL database) 
>>>
>>
>> So, if I understand correctly, since it is a pending modification on a 
>> now deleted inode, this is not really a problem, right? Said otherwise, the 
>> filesystem is not corrupted? [...]
>>
>>
>> No, the inode/file does not have to be deleted. There is a pending 
>> metadata modification (access time, modification time, file size, file 
>> mode) that should have been persisted (written in the Sqlite-Database) but 
>> it did not made it in the database.
>> File data is not corrupted, but some metadata of your files might not be 
>> in the correct state.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/s3ql/e1882d14-f582-42ce-8468-1d07d0eada3cn%40googlegroups.com.

Reply via email to