Hello, A quick follow-up on this : since I put the patch live, I didn't get a single new BUG ALERT. Looks like that was the only one I was getting.
Cheers, Nicolas On Thursday, September 17, 2020 at 9:07:32 PM UTC+2 [email protected] wrote: > Great work, thank you for tracking this down Nicolas! > > Best, > -Nikolaus > > -- > GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F > > »Time flies like an arrow, fruit flies like a Banana.« > > > > On Thu, 17 Sep 2020, at 17:26, Nicolas Deschildre wrote: > > Hello, > > Ok I found it!! (Or at least, one of them). > And a sure way to reproduce the crash. > All this time I was looking at a async/await bug... but it was finally > another kind of concurrency issue :-) > > The bug : > If you move a file from a folder to another, and the inode cache happens > to be in a very specific condition (file to be copied uncached, destination > folder in cache about to be evicted on next inode cache miss, origin folder > in cache about to be evicted next after). > --> fs.py:654 Reduce a lot the occurence of this bug by ensuring the > origin and destination folder are in inode cache > --> fs.py:658 self._lookup() load the inode of the file to be copied, > evicting the destination folder inode. > --> fs.py:719 self.inodes[id_p_old] Load the origin folder which was still > in cache > --> fs.py:720 self.inodes[id_p_new] Load the destination folder, evicts > the origin folder inode. > --> fs.py:721/723 Changes are made on the origin folder inode, which is no > longer handled by the inode cache. We have the bug. > > Fix here : https://github.com/s3ql/s3ql/pull/199 > > Thanks for the pointers! > > Nicolas Deschildre > > On Friday, September 11, 2020 at 8:50:24 AM UTC+2 Nicolas Deschildre wrote: > > Hello, > > Thanks, Ok, I now understand the code better. > InodeCache holds 100 Inodes as cache in a ring. When a new inode not in > cache is requested, an inode in the cache is flushed, and the new inode is > stored in the cache instead. > The bug : Race condition : 2 inodes are requested from InodeCache at the > same time. Thread 1 requesting inode 1 flush and remove inode 2 from cache. > Thread 2 got inode 2 before it was removed from cache by Thread 1, but > makes changes after it was removed and flushed by Thread 1. Thead 2 ends, > there are no longer references to inode 2, python garbage collect it, and > this trigger the bug. > > I see 2 possibles solutions : > 1/ https://github.com/s3ql/s3ql/pull/196 : The _Inode class keeps a > reference to the InodeCache. On __del__, if we encouter the above bug, we > flush ourselves. The problem (and I'm not familiar enough with python) : I > guess garbage collection could happen at shutdown, when the InodeCache SQL > connection is no longer valid. Do you see a way to make this approach work? > 2/ (Not coded yet) : The InodeCache is a ring, first in first out. What if > we store access time on InodeCache.__getitem__, and the inode to be removed > is the most old accessed one? This solution should reduce a lot (but not > eliminate) the race condition. What do you think? > > Finally : I tried to reproduce locally, with a unencrypted, uncompressed > local backend, with mass parralel attribute changes, but I was not able to > reproduce the bug. > > Thanks, > Nicolas Deschildre > > On Wednesday, September 9, 2020 at 5:50:19 PM UTC+2 [email protected] > wrote: > > Hello Nicolas, > > > > S3QL somehow manages to delete/garbage collects an _Inode object that is > dirty (i.e. has an attribute modification that is not persisted in the > S3QL database) > > > So, if I understand correctly, since it is a pending modification on a now > deleted inode, this is not really a problem, right? Said otherwise, the > filesystem is not corrupted? [...] > > > No, the inode/file does not have to be deleted. There is a pending > metadata modification (access time, modification time, file size, file > mode) that should have been persisted (written in the Sqlite-Database) but > it did not made it in the database. > File data is not corrupted, but some metadata of your files might not be > in the correct state. > > > > -- > You received this message because you are subscribed to the Google Groups > "s3ql" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/s3ql/e1882d14-f582-42ce-8468-1d07d0eada3cn%40googlegroups.com > > <https://groups.google.com/d/msgid/s3ql/e1882d14-f582-42ce-8468-1d07d0eada3cn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > -- You received this message because you are subscribed to the Google Groups "s3ql" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/s3ql/f901dd31-44f2-441c-8f93-2b5b9fbf0ecan%40googlegroups.com.
