Re: Safe File Update (atomic)
On Thu, Jan 6, 2011 at 7:59 PM, Enrico Weigelt weig...@metux.de wrote: * Olaf van der Spek olafvds...@gmail.com schrieb: A transaction to update multiple files in one atomic go? Yes. The application first starts an transaction, creates/writes/ removes a bunch of files and then sends a commit. The changes should become visible atomically and the call returns when the commit() is completed (and written out to disk). If there're conflics, the transaction is aborted w/ proper a error code. So, in case of a package manager, the update will run completely in one shot (from userland view) or not at all. I could live with: a) relatively slow performance (commit taking a second or so) b) abort as soon as an conflict arises c) files changed within the transaction are actually new ones (sane package managers will have to unlink text files instead simply overwriting nevertheless) That would be nice, but the single file case appears to be difficult enough already. So we might want to focus on that first. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimegmsjdukzghfpnikvipahnz6hb3bjpemtb...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, Jan 6, 2011 at 1:54 AM, Ted Ts'o ty...@mit.edu wrote: I was thinking, doesn't ext have this kind of dependency tracking already? It has to write the inode after writing the data, otherwise the inode might point to garbage. No, it doesn't. We use journaling, and forced data writeouts, to ensure consistency. Suppose I append one byte to an existing file, I don't use fsync. Will it commit the inode with the increased size before the data byte is written? In that case, garbage might show up in my file. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlkti=dw8ks+s3ynw+1-axk3-tsx0ez5g+or5cgv...@mail.gmail.com
Re: Safe File Update (atomic)
* Ted Ts'o ty...@mit.edu [110105 19:26]: So one of the questions is how much should be penalizing programs that are doing things right (i.e., using fsync), versus programs which are doing things wrong (i.e., using rename and trusting to luck). Please do not call it wrong. All those programs doing is not requesting some specific protection. They are doing file system operations that are totally within the normal abstraction level of file system interfaces. While some programs might be expected to anticipicate cases not within that interface (i.e. the case that due to some external event the filesystem is interupted in what it does and cannot complete its work), that is definitly not the responsibility of the average program, especially if there is no interface for this specific problem (i.e. requesting a barrier to only do a rename after the new file is actually commited to disk). So the question is: How much should the filesystem protect my data in case of sudden power loss? Should it only protect data where the program explicitly requested something explicitly, or should it also do what it reasonably can to protect all data. Having some performance knobs so users can choose between performance and data safety is good. This way users can make decisions depending what they want. But a filesystem losing data so easily or with a default setting losing data so easily is definitly not something to give unsuspecting users. Bernhard R. Link -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110106105913.gb14...@pcpool00.mathematik.uni-freiburg.de
Re: Safe File Update (atomic)
On Thu, Jan 6, 2011 at 5:01 AM, Ted Ts'o ty...@mit.edu wrote: On Thu, Jan 06, 2011 at 12:57:07AM +, Ian Jackson wrote: Ted Ts'o writes (Re: Safe File Update (atomic)): Then I invite you to implement it, and start discovering all of the corner cases for yourself. :-) As I predicted, you're not going to believe me when I tell you it's too hard. How about you reimplement all of Unix userland, first, so that it doesn't have what you apparently think is a bug! I think you are forgetting the open source way, which is you scratch your own itch. Most of the time one is writing software because it's useful for oneselves and others. Not because writing software itself is so much fun. It's about the result. So focus should be on what those users need/want. The the main programs I use where I'd care about this (e.g., emacs) got this right two decades ago; I even remember being around during the MIT Project Athena days, almost 25 years ago, when we needed to add error checking to the fsync() call because Transarc's AFS didn't actually try to send the file you were saving to the file server until the fsync() or the close() call, and so if you got an over-quota error, it was reflected back at fsync() time, and not at the write() system call which was what emacs had been expecting and checking. (All of which is POSIX compliant, so the bug was clearly with emacs; it was fixed, and we moved on.) Would you classify the emacs implementation of safe file write semantics simple or complex? Why did they not get it right the first time? IMO it's because the API is hard to use and easy to misuse, while it should be the other way around. Hiding behind POSIX semantics is easy but doesn't solve the problem. Note that all of the modern file systems (and all of the historical ones too, with the exception of ext3) have always had the same property. If you care about the data, you use fsync(). If you don't, then you can take advantage of the fact that compiles are really, really fast. (After all, in the very unlikely case that you crash, you can always rebuild, and why should you optimize for an unlikely case? And if you have crappy proprietary drivers that cause you to crash all the time, then maybe you should rethink using said proprietary drivers.) That's the open source way --- you scratch your own itch. I'm perfectly satisifed with the open source tools that I use. Unless you think the programmers two decades ago were smarter, and people have gotten dumber since then (Are we not men? We are Devo!), it really isn't that hard to follow the rules. I think the number of programmers today is much larger than it was two decades ago and I also think the average experience of the programmer went down. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktikj-yl_emcavvls_xn3imrqcpzteavqpey...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, Jan 6, 2011 at 12:39 PM, Olaf van der Spek olafvds...@gmail.com wrote: On Thu, Jan 6, 2011 at 5:01 AM, Ted Ts'o ty...@mit.edu wrote: On Thu, Jan 06, 2011 at 12:57:07AM +, Ian Jackson wrote: Ted Ts'o writes (Re: Safe File Update (atomic)): Then I invite you to implement it, and start discovering all of the corner cases for yourself. :-) As I predicted, you're not going to believe me when I tell you it's too hard. How about you reimplement all of Unix userland, first, so that it doesn't have what you apparently think is a bug! I think you are forgetting the open source way, which is you scratch your own itch. Most of the time one is writing software because it's useful for oneselves and others. Not because writing software itself is so much fun. It's about the result. So focus should be on what those users need/want. The the main programs I use where I'd care about this (e.g., emacs) got this right two decades ago; I even remember being around during the MIT Project Athena days, almost 25 years ago, when we needed to add error checking to the fsync() call because Transarc's AFS didn't actually try to send the file you were saving to the file server until the fsync() or the close() call, and so if you got an over-quota error, it was reflected back at fsync() time, and not at the write() system call which was what emacs had been expecting and checking. (All of which is POSIX compliant, so the bug was clearly with emacs; it was fixed, and we moved on.) Could you point to the code snippet ? I could be worth to add to gnulib for instance Another point is to create some fuse filesytem for testing error condition on filesystem. For instance a filesystem that create EIO error randomly or EFULL, it will improve the quality of our software. I have written such a filesystem, I will surelly post it in a few day (hopefully) Bastien -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktin5v_cuuetf3j1p1_mtkv3oq2+sk9vstn453...@mail.gmail.com
Re: Safe File Update (atomic)
Getting people to believe that you can't square a circle[1] is very hard, Just allow an infinite number of steps and it's almost trivial ;-) It's like trying teaching a pig to sing. Well, that works, just sounds a bit like vogon poetry ;-o If you give me a specific approach, I can tell you why it won't work, or why it won't be accepted by the kernel maintainers (for example, because it involves pouring far too much complexity into the kernel). To come back to the original question, I'd like to know which concrete realworld problems should be solved by that. One thing an database-like transactional filesystem (w/ MVCC) would be nice is package managers: we still have the problem that within the update process there may be inconsistent states (yes, this had already bitten me!) - if it would be possible to make an update visible atomically, that would be a big win for critical 24/7 systems. My approach to this would be an special unionfs with transactional semantics (I admit: no idea how complex it would be implementing this) cu -- -- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weig...@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 -- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme -- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110106183337.gd14...@nibiru.local
Re: Safe File Update (atomic)
On Thu, Jan 6, 2011 at 7:33 PM, Enrico Weigelt weig...@metux.de wrote: To come back to the original question, I'd like to know which concrete realworld problems should be solved by that. One thing an database-like transactional filesystem (w/ MVCC) would be nice is package managers: we still have the problem that within the update process there may be inconsistent states (yes, this had already bitten me!) - if it would be possible to make an update visible atomically, that would be a big win for critical 24/7 systems. My approach to this would be an special unionfs with transactional semantics (I admit: no idea how complex it would be implementing this) A transaction to update multiple files in one atomic go? Nah, this request is for just a single file, although a future extension to multiple files shouldn't be too hard. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlkti=ur3avv65qfbo+-074fons__8pkq=5zwljm...@mail.gmail.com
Re: Safe File Update (atomic)
* Olaf van der Spek olafvds...@gmail.com schrieb: A transaction to update multiple files in one atomic go? Yes. The application first starts an transaction, creates/writes/ removes a bunch of files and then sends a commit. The changes should become visible atomically and the call returns when the commit() is completed (and written out to disk). If there're conflics, the transaction is aborted w/ proper a error code. So, in case of a package manager, the update will run completely in one shot (from userland view) or not at all. I could live with: a) relatively slow performance (commit taking a second or so) b) abort as soon as an conflict arises c) files changed within the transaction are actually new ones (sane package managers will have to unlink text files instead simply overwriting nevertheless) cu -- -- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weig...@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 -- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme -- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110106185922.ge14...@nibiru.local
Re: Safe File Update (atomic)
On Wed, Jan 5, 2011 at 1:25 AM, Ted Ts'o ty...@mit.edu wrote: On Wed, Jan 05, 2011 at 01:05:03AM +0100, Olaf van der Spek wrote: Why is it that you ignore all my responses to technical questions you asked? In general, because they are either (a) not well-formed, or (b) you are asking me to prove a negative. Getting people to believe that you Saying that instead of ignoring half of my response would be more constructive. If you give me a specific approach, I can tell you why it won't work, or why it won't be accepted by the kernel maintainers (for example, because it involves pouring far too much complexity into the kernel). Let's consider the temp file workaround, since a lot of existing apps use it. A request is to commit the source data before committing the rename. Seems quite simple. But for me to list all possible approaches and tell you why each one is not going to work? You'll have to pay me before I'm willing to invest that kind of time. That's not what I asked. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinhnusyvui0jzompde3mkpbuawemy3_-cnpb...@mail.gmail.com
Re: Safe File Update (atomic)
On Wed, Jan 05, 2011 at 12:55:22PM +0100, Olaf van der Spek wrote: If you give me a specific approach, I can tell you why it won't work, or why it won't be accepted by the kernel maintainers (for example, because it involves pouring far too much complexity into the kernel). Let's consider the temp file workaround, since a lot of existing apps use it. A request is to commit the source data before committing the rename. Seems quite simple. Currently ext4 is initiating writeback on the source file at the time of the rename. Given performance measurements others (maybe it was you, I can't remember, and I don't feel like going through the literally hundreds of messages on this and related threads) have cited, it seems that btrfs is doing something similar. The problem with doing a full commit, which means surviving a power failure, is that you have to request a barrier operation to make sure the data goes all the way down to the disk platter --- and this is expensive (on the order of at least 20-30ms, more if you've written a lot to the disk). We have had experience with forcing data writeback (what you call commit the source data) before the rename --- ext3 did that. And it had some very nasty performance problems which showed up very busy systems where people were doing a lot of different things at the same time: large background writes from bittorrents and/or DVD ripping, compiles, web browsing, etc. If you force a large amount of data out when you do a commit, everything else that tries to write to the file system at that point stops, and if you have stupid programs (i.e., firefox trying to do database updates on its UI loop), it can cause programs to apparently lock up, and users get really upset. So one of the questions is how much should be penalizing programs that are doing things right (i.e., using fsync), versus programs which are doing things wrong (i.e., using rename and trusting to luck). This is a policy question, for which you might have a different opinion than I might have on the subject. We could also simply force a synchronous data writeback at rename time, instead of merely starting writeback at the point of the rename. In the case of a program which has already done an fsync(), the synchronous data writeback would be a no-op, so that's good in terms of not penalizing programs which do things right. But the problem there is that there could be some renames where forcing data writeback is not needed, and so we would be forcing the performance hit of the commit the source data even when it might not be needed (or wanted) by the user. How often does it happen that someone does a rename on top of an already-existing file, where the fsync() isn't wanted. Well, I can think up scenarios, such as where an existing .iso image is corrupted or needs to be updated, and so the user creates a new one and then renames it on top of the old .iso image, but then gets surprised when the rename ends up taking minutes to complete. Is that a common occurrence? Probably not, but the case of the system crashing right after the rename() is someone unusual as well. Humans in general suck at reasoning about low-probability events; that's why we are allowing low-paid TSA workers to grope air-travellers to avoid terrorist blowing up planes midflight, while not being up in arms over the number of deaths every year due to automobile accidents. For this reason, I'm cautious about going overboard at forcing commits on renames; doing this has real performance implications, and it is a computer science truism that optimizing for the uncommon/failure case is a bad thing to do. OK, what about simply deferring the commit of the rename until the file writeback has naturally completed? The problem with that is entangled updates. Suppose there is another file which is written to the same directory block as the one affected by the rename, and *that* file is fsync()'ed? Keeping track of all of the data dependencies is **hard**. See: http://lwn.net/Articles/339337/ But for me to list all possible approaches and tell you why each one is not going to work? You'll have to pay me before I'm willing to invest that kind of time. That's not what I asked. Actually, it is, although maybe you didn't realize it. Look above, and how I had to present multiple alternatives, and then shoot them all down, one at a time. There are hundreds of solutions, all of them wrong. Hence why *my* counter is --- submit patches. The mere act of actually trying to code an alternative will allow you to determine why your approach won't work, or failing that, others can take your patch, apply them, and then demonstrate use cases where your idea completely falls apart. But it means that you do most of the work, which is fair since you're the one demanding the feature. It doesn't scale for me to spend a huge amount of time composing e-mails like this, which is why it's rare that I do that. You've tricked me into
Re: Safe File Update (atomic)
On Wed, Jan 5, 2011 at 7:26 PM, Ted Ts'o ty...@mit.edu wrote: On Wed, Jan 05, 2011 at 12:55:22PM +0100, Olaf van der Spek wrote: If you give me a specific approach, I can tell you why it won't work, or why it won't be accepted by the kernel maintainers (for example, because it involves pouring far too much complexity into the kernel). Let's consider the temp file workaround, since a lot of existing apps use it. A request is to commit the source data before committing the rename. Seems quite simple. Currently ext4 is initiating writeback on the source file at the time of the rename. Given performance measurements others (maybe it was you, I can't remember, and I don't feel like going through the literally hundreds of messages on this and related threads) have cited, it seems that btrfs is doing something similar. The problem with doing a full commit, which means surviving a power failure, is that you have to request a barrier operation to make sure the data goes all the way down to the disk platter --- and this is expensive (on the order of at least 20-30ms, more if you've written a lot to the disk). We have had experience with forcing data writeback (what you call commit the source data) before the rename --- ext3 did that. And it had some very nasty performance problems which showed up very busy systems where people were doing a lot of different things at the same time: large background writes from bittorrents and/or DVD ripping, compiles, web browsing, etc. If you force a large amount of data out when you do a commit, everything else that tries to write to the file system at that point stops, and if you have stupid programs (i.e., firefox trying to do database updates on its UI loop), it can cause programs to apparently lock up, and users get really upset. I'm not sure why other IO would be affected. Isn't this equivalent to fsync on the source file? It almost sounds like you lock the entire FS during the data writeback, which shouldn't be necessary. So one of the questions is how much should be penalizing programs that are doing things right (i.e., using fsync), versus programs which are doing things wrong (i.e., using rename and trusting to luck). This is a policy question, for which you might have a different opinion than I might have on the subject. We could also simply force a synchronous data writeback at rename time, instead of merely starting writeback at the point of the rename. In the case of a program which has already done an fsync(), the synchronous data writeback would be a no-op, so that's good in terms of not penalizing programs which do things right. But the problem there is that there could be some renames where forcing data writeback is not needed, and so we would be forcing the performance hit of the commit the source data even when it might not be needed (or wanted) by the user. How often does it happen that someone does a rename on top of an already-existing file, where the fsync() isn't wanted. Well, I can think up scenarios, such as where an existing .iso image is corrupted or needs to be updated, and so the user creates a new one and then renames it on top of the old .iso image, but then gets surprised when the rename ends up taking minutes to complete. Is that a common Would this be an example of an atomic non-durable use case? ;) I thought those didn't exist? occurrence? Probably not, but the case of the system crashing right after the rename() is someone unusual as well. Given the reports of empty files not that unusual. The delay in this unusual case seems like a small price to pay. For this reason, I'm cautious about going overboard at forcing commits on renames; doing this has real performance implications, and it is a computer science truism that optimizing for the uncommon/failure case is a bad thing to do. Performance is important, I agree. But you're trading performance for safety here. And on rename, you have to guess the user's intention: just rename or atomic file update. OK, what about simply deferring the commit of the rename until the file writeback has naturally completed? The problem with that is entangled updates. Suppose there is another file which is written to the same directory block as the one affected by the rename, and *that* file is fsync()'ed? Keeping track of all of the data dependencies is **hard**. See: http://lwn.net/Articles/339337/ Ah. So performance isn't the problem, it's just hard too implement. Would've been a lot faster if you said that earlier. Instead, you require apps to use fsync, even if they don't need/want it, which introduces a performance hit. Wasn't there a big problem with fsync in ext3 anyway? BTW, with O_ATOMIC, you could avoid the updates to directory blocks and would only have to track other updates to the same inode. But for me to list all possible approaches and tell you why each one is not going to work? You'll have to pay me
Re: Safe File Update (atomic)
On Wed, Jan 05, 2011 at 09:38:30PM +0100, Olaf van der Spek wrote: Performance is important, I agree. But you're trading performance for safety here. ... but if the safety is not needed, then you're paying for no good reason. And if performance is needed, then use fsync(). OK, what about simply deferring the commit of the rename until the file writeback has naturally completed? The problem with that is entangled updates. Suppose there is another file which is written to the same directory block as the one affected by the rename, and *that* file is fsync()'ed? Keeping track of all of the data dependencies is **hard**. See: http://lwn.net/Articles/339337/ Ah. So performance isn't the problem, it's just hard too implement. Would've been a lot faster if you said that earlier. Too hard to implement doesn't go far enough. It's also a matter of near impossibility to add new features later. BSD FFS didn't get ACL's, extended attributes, and many other features ***years*** after Linux had them. Complexity is evil; it leads to bugs, makes things hard to maintain, and it makes it harder to add new features later. But hey, if you're so smart, you go ahead and implement them yourself. You can demonstrate how you can do it better than everyone else. Otherwise you're just wasting everybody's time. Complex ideas are not valid ones; or at least they certainly aren't good ones. - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110105213737.gp2...@thunk.org
Re: Safe File Update (atomic)
On Wed, Jan 5, 2011 at 10:37 PM, Ted Ts'o ty...@mit.edu wrote: Ah. So performance isn't the problem, it's just hard too implement. Would've been a lot faster if you said that earlier. Too hard to implement doesn't go far enough. It's also a matter of near impossibility to add new features later. BSD FFS didn't get ACL's, extended attributes, and many other features ***years*** after Linux had them. Complexity is evil; it leads to bugs, makes things hard to maintain, and it makes it harder to add new features later. That was about soft updates. I'm not sure this is just as complex. I was thinking, doesn't ext have this kind of dependency tracking already? It has to write the inode after writing the data, otherwise the inode might point to garbage. But hey, if you're so smart, you go ahead and implement them yourself. You can demonstrate how you can do it better than everyone else. Otherwise you're just wasting everybody's time. Complex ideas are not valid ones; or at least they certainly aren't good ones. Nobody said FSs are simple. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlkti=nxzymkerpue4bai0oe9cn2dcz4=+y1rqio...@mail.gmail.com
Re: Safe File Update (atomic)
On Wed, Jan 05, 2011 at 10:47:03PM +0100, Olaf van der Spek wrote: That was about soft updates. I'm not sure this is just as complex. Then I invite you to implement it, and start discovering all of the corner cases for yourself. :-) As I predicted, you're not going to believe me when I tell you it's too hard. I was thinking, doesn't ext have this kind of dependency tracking already? It has to write the inode after writing the data, otherwise the inode might point to garbage. No, it doesn't. We use journaling, and forced data writeouts, to ensure consistency. - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110106005456.gq2...@thunk.org
Re: Safe File Update (atomic)
Ted Ts'o writes (Re: Safe File Update (atomic)): Then I invite you to implement it, and start discovering all of the corner cases for yourself. :-) As I predicted, you're not going to believe me when I tell you it's too hard. How about you reimplement all of Unix userland, first, so that it doesn't have what you apparently think is a bug! Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/19749.4963.87525.539...@chiark.greenend.org.uk
Re: Safe File Update (atomic)
On Thu, Jan 06, 2011 at 12:57:07AM +, Ian Jackson wrote: Ted Ts'o writes (Re: Safe File Update (atomic)): Then I invite you to implement it, and start discovering all of the corner cases for yourself. :-) As I predicted, you're not going to believe me when I tell you it's too hard. How about you reimplement all of Unix userland, first, so that it doesn't have what you apparently think is a bug! I think you are forgetting the open source way, which is you scratch your own itch. The the main programs I use where I'd care about this (e.g., emacs) got this right two decades ago; I even remember being around during the MIT Project Athena days, almost 25 years ago, when we needed to add error checking to the fsync() call because Transarc's AFS didn't actually try to send the file you were saving to the file server until the fsync() or the close() call, and so if you got an over-quota error, it was reflected back at fsync() time, and not at the write() system call which was what emacs had been expecting and checking. (All of which is POSIX compliant, so the bug was clearly with emacs; it was fixed, and we moved on.) If there was a program that I used and where I'd care about it, I'd scratch my own itch and fix it. Olaf seems to really concerned about this theoretical use case, and if he cares so much, he can either stick with ext3, which has the property he wants purely by accident, but which has terrible performance problem under some circumstances as a result, or he can fix it in the programs that he cares about --- or he can try to create his own file system (and he can either impress us if he actually can solve it without disastrous performance problems, or he can be depressed when no one uses it because it is dog slow). Note that all of the modern file systems (and all of the historical ones too, with the exception of ext3) have always had the same property. If you care about the data, you use fsync(). If you don't, then you can take advantage of the fact that compiles are really, really fast. (After all, in the very unlikely case that you crash, you can always rebuild, and why should you optimize for an unlikely case? And if you have crappy proprietary drivers that cause you to crash all the time, then maybe you should rethink using said proprietary drivers.) That's the open source way --- you scratch your own itch. I'm perfectly satisifed with the open source tools that I use. Unless you think the programmers two decades ago were smarter, and people have gotten dumber since then (Are we not men? We are Devo!), it really isn't that hard to follow the rules. - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110106040123.ga27...@thunk.org
Re: Safe File Update (atomic)
On Mon, Jan 3, 2011 at 3:43 PM, Ted Ts'o ty...@mit.edu wrote: On Mon, Jan 03, 2011 at 12:26:29PM +0100, Olaf van der Spek wrote: Given that the issue has come up before so often, I expected there to be a FAQ about it. Your asking the question over (and over... and over...) doesn't make it an FAQ. :-) Hi Ted, Why is it that you ignore all my responses to technical questions you asked? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktin0v4sl2zjqkxdekqtuowz3fazkrdbbne=wc...@mail.gmail.com
Re: Safe File Update (atomic)
On Wed, Jan 05, 2011 at 01:05:03AM +0100, Olaf van der Spek wrote: Why is it that you ignore all my responses to technical questions you asked? In general, because they are either (a) not well-formed, or (b) you are asking me to prove a negative. Getting people to believe that you can't square a circle[1] is very hard, and when I was one of the postmasters at MIT, we'd get kooks every so often saying that they had a proof that they could square the circle, but everyone was being unfair and ignoring them, and could we please forward this to the head of MIT's math department with their amazing discovery. We learned a long time ago that it's not worth trying to argue with kooks like that. It's like trying teaching a pig to sing. It frustrates you, and it annoys the pig. [1] http://en.wikipedia.org/wiki/Squaring_the_circle If you give me a specific approach, I can tell you why it won't work, or why it won't be accepted by the kernel maintainers (for example, because it involves pouring far too much complexity into the kernel). But for me to list all possible approaches and tell you why each one is not going to work? You'll have to pay me before I'm willing to invest that kind of time. Best regards, - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110105002537.gi2...@thunk.org
Re: Safe File Update (atomic)
On 02/01/11 17:37, Olaf van der Spek wrote: A userspace lib is fine with me. In fact, I've been asking for it multiple times. Result: no response. Excuse me? You (well, Henrique, but you were CCed) said how about a user space lib? I said I'm working on one, will be ready about this weekend. I even gave a URL to watch (https://github.com/Shachar/safewrite). If you check it out right now, you will find there a fully implemented and fairly debugged user space solution, even including a build tool and man page. BTW - feedback welcome. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d21a67b.50...@debian.org
Re: Safe File Update (atomic)
On Sun, Jan 2, 2011 at 6:14 PM, Henrique de Moraes Holschuh Maybe I wasn't clear, in that case I'm sorry. To me, O_ATOMIC is Whether this should map to O_ATOMIC in glibc or be something new, I don't care. But if it is a flag, I'd highly suggest naming it O_CREATEUNLINKED or something else that won't give people wrong ideas, as _nothing_ but the final inode linking is atomic. How does this solve the meta-data issue? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlkti=vvjs5yh84dfq-oeumym7btjng+yu0pkk-r...@mail.gmail.com
Re: Safe File Update (atomic)
On Sun, Jan 2, 2011 at 7:55 PM, Adam Borowski kilob...@angband.pl wrote: Note that on the other side of the fence there's something called TxF Not GA AFAIK. And what if you're changing one byte inside a 50 GB file? I see an easy implementation on btrfs/ocfs2 (assuming no other writers), but on ext3/4, that'd be tricky. My proposal is explicitly only for complete file data updates. what should an application do as a fallback? And given that it is Fallback could be implement in the kernel or in userland. Using rename as a fallback sounds reasonable. Implementations could switch to O_ATOMIC when available. For large files using reflink (currently implemented as fs-specific ioctls) can ensure performance. It can give you anything but the abuse for preserving owner (ie, the topic of this thread). To get that, you'd need in-kernel support, but for example http://lwn.net/Articles/331808/ proposes an API which is just a thin wrapper over existing functionality in multiple filesystems. It basically duplicates an inode, preserving all current attributes but making any new writes CoWed. If you make the old one immutable, you get the TxF semantics (mandatory write lock), if you don't, you'll get the mentioned above one of the updates will win data loss. Data loss? If you overwrite a file, losing the old contents isn't data loss. And what are the use cases where this really makes sense? Will people Lots of files are written in one go. They could all use this interface. I don't see how O_ATOMIC helps there. TxF transactions would work (all writes either succeed together or none does), but O_ATOMIC can't do more than one file. I mean that each app that writes a file in one go could use the O_ATOMIC API. Extending O_ATOMIC to support multiple files seems simple too by using a vector variant of close. Uhm, but you didn't answer the question. These two use cases Ted Tso mentioned are certainly not worth the complexity of in-kernel support, O_ATOMIC doesn't bring other goodies, and the rest can be done by an userspace library which is indeed a good idea. Someone is working on such a lib, let's see the code complexity and exceptions it has. Not true. I've asked (you) for just such a lib, but I'm still waiting for an answer. Shachar Shemesh is already working on it, when he finishes, Ted Tso will point out what's wrong in it (if something is). What else do you need? Don't know yet. Let's wait for that lib. Why would anyone work on an implementation if there's no agreement about it? Because one implementation after research is better than many naive and possibly wrong ones. True, but it'd still be nice to have some agreement before doing all that hard work. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinpqeqrybqtx4avr1aa=t2t04yo5++r0_ynd...@mail.gmail.com
Re: Safe File Update (atomic)
On Mon, Jan 3, 2011 at 4:25 AM, Ted Ts'o ty...@mit.edu wrote: On Sun, Jan 02, 2011 at 04:14:15PM +0100, Olaf van der Spek wrote: Last time you ignored my response, but let's try again. The implementation would be comparable to using a temp file, so there's no need to keep 2 g in memory. Write the 2 g to disk, wait one day, append the 1 k, fsync, update inode. Write the 2g to disk *where*? Some random assigned blocks? And using A random allocation strategy would work, but better options are available. ;) *what* to keep track of the where to find all of the metadata blocks? Implementation detail, but a new temp inode might work. That information is normally stored in the inode, but you don't want to touch it. So we need to store it someplace, and you haven't specified where. Some alternate universe? Another inode, which is only tied to that file descriptor? That's *possible*, but it's (a) not at all trivial, and (b) won't work for all file systems. It definitely won't work for FAT based file systems, so your blithe, oh, just emulate it in the kernel is rather laughable. You'd have to decide what you want to do in that case. One option is to fallback to the non-atomic variant. Another is to fallback to a temp file with a name. But then I assume the kernel is still able to preseve meta-data and to ensure atomic operation. If you think it's so easy, *you* go implement it. How exactly do the semantics for O_ATOMIC work? And given at the momment ***zero*** file systems implement O_ATOMIC, what should an application do as a fallback? And given that it is Fallback could be implement in the kernel or in userland. Using rename as a fallback sounds reasonable. Implementations could switch to O_ATOMIC when available. Using rename as a fallback means exposing random temp file names into the directory. Which could conflict with files that the userspace might want to create. They don't need to be in the same dir. It could be done, but again, it's an awful lot of complexity to shove into the kernel. That's unfortunate but I think the only option. highly unlikely this could ever be implemented for various file systems including NFS, I'll observe this won't really reduce application complexity, since you'll always need to have a fallback for file systems and kernels that don't support O_ATOMIC. I don't see a reason why this couldn't be implemented by NFS. Try it; it should become obvious fairly quickly. Or just go read the NFS protocol specifications. In that case: update the NFS protocol (yes, long-term solution) As you've said yourself, a lot of apps don't get this right. Why not? Because the safe way is much more complex than the unsafe way. APIs should be easy to use right and hard to misuse. With O_ATOMIC, I feel this is the case. Without, it's the opposite and the consequences are obvious. There shouldn't be a tradeoff between safety and potential problems. Application programmers have in the past been unwilling to change their applications. Why not? If they are willing to change their applications, they can just as easily use a userspace library, or use fsync() and rename() properly. If they aren't willing to change their programs Fsync, rename (and preserving meta-data) is a lot more complex then their current code which for me would be a disadvantage. O_ATOMIC is a single flag that doesn't increase their code complexity. A new lib dependency is also a disadvantage. and recompile (and the last time we've been around this block, they weren't; they just blamed the file system), asking them to use O_ATOMIC probably won't work, given the portiability issues. If they're happy to blame the FS they're probably also happy to #define O_ATOMIC 0 if O_ATOMIC isn't available. Not true. I've asked (you) for just such a lib, but I'm still waiting for an answer. Pay someone enough money, and they'll write you the library. Whining about it petulently and expecting someone else to write it is probably not going to work. Given that the issue has come up before so often, I expected there to be a FAQ about it. I didn't say you had to write such a lib, just saying you weren't aware of any existing lib would've been enough. But given that you have so much experience on this issue, pointing to a few apps that got this right shouldn't be so hard. Unless getting it right is currently impossible... Quite frankly, if you're competent enough to use it, you should be able to write such a library yourself. If you aren't going to be using it yourself, they why are you wasting everyone's time on this? Because this is still a real world problem that needs to be solved. Stopping this conversation isn't going to solve the problem. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive:
Re: Safe File Update (atomic)
Ted, Thanks for the reply and detailed analysis. Which gets me back to the question of use cases. When are we going to be using this monster? For many use cases, where the original reason Where implicit rollbacks are desireable, I suppose. It is incompatible with edit-in-place, anyway. Which asks for all the fsyncs on the link thing. Anyone that wants something different is welcome to do it the old way IMHO. The first part (get an unlinked fd) is useful without fsyncs or any guarantees for temp files. by the kernel. But if you use make the system call synchronous, now there's no performance advantage over simply doing the fsync() and rename() in userspace. And if we do this using O_ATOMIC, or your I understand this is far more about ease of use (read: more difficult to misuse) than much higher performance. 1) You care about data loss in the case of power failure, but not in the case of hard drive or storage failure, *AND* you are writing tons and tons of tiny 3-4 byte files and so you are worried about performance because you're doing something insane with large number of small files. That usage pattern cannot be made both safe and fast outside of a full-blown ACID database, so lets skip it. 2) You are specifically worried about the case where you are replacing the contents of a file that is owned by different uid than the user doing the data file update, in a safe way where you don't want a partially written file to replace the old, complete file, *AND* you care about the file's ownership after the data update. I am not sure about the file ownership, but this is the useful usecase IMO. 3) You care about the temp file used by the userspace library, or application which is doing the write temp file, fsync(), rename() scheme, being automatically deleted in case of a system crash or a process getting sent an uncatchable signal and getting terminated. This is always useful, as well. Is it worth it? I'd say no; and suggest that someone who really cares should create a userspace application helper library first, since you'll need it as a fallback for the cases listed above where this scheme won't work. (Even if you do the fallback in the kernel, you'll still need userspace fallback for non-Linux systems, and for when the application is run on an older Linux kernel that doesn't have all of this O_ATOMIC or link/unlink magic). That's what I suggested, as well. The reality is we've lived without this capability in Unix and Linux system for something like three decades. I suspect we can live But not very well. And the usage patterns of *nix systems have changed in the last decade. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110103114940.ga9...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On Mon, Jan 03, 2011 at 09:49:40AM -0200, Henrique de Moraes Holschuh wrote: 1) You care about data loss in the case of power failure, but not in the case of hard drive or storage failure, *AND* you are writing tons and tons of tiny 3-4 byte files and so you are worried about performance because you're doing something insane with large number of small files. That usage pattern cannot be made both safe and fast outside of a full-blown ACID database, so lets skip it. Agreed. 2) You are specifically worried about the case where you are replacing the contents of a file that is owned by different uid than the user doing the data file update, in a safe way where you don't want a partially written file to replace the old, complete file, *AND* you care about the file's ownership after the data update. I am not sure about the file ownership, but this is the useful usecase IMO. But if you don't care about file ownership, then you can do the write a temp file, fsync, and rename trick. If it's about ease of use, as you suggest, a userspace library solves that problem. It's *only* if you care about the file ownership remaining the same that (2) comes into play. 3) You care about the temp file used by the userspace library, or application which is doing the write temp file, fsync(), rename() scheme, being automatically deleted in case of a system crash or a process getting sent an uncatchable signal and getting terminated. This is always useful, as well. and (3) is the recovery after a power failure/crash scenario If you don't care about the file ownership issue, then recovering after a powerfailure/crash is the last remaining case --- and you could solve this by creating a file with an mktemp-style name in a mode 1777 directory, where the contents of the file contains the temp file name to be deleted by an init.d script. This could be done in the userspace library, and if you crash after the rename, but before you have a chance to delete the file containing the temp-filename-to-be-deleted, that's not a problem, since the init.d file will find no file with that name to be deleted, and then continue. Hence, all of these problems can be solved in userspace, with a userspace library, with the exception of the file ownership issue, which you've admitted may not be all that critical. Is it worth it? I'd say no; and suggest that someone who really cares should create a userspace application helper library first, since you'll need it as a fallback for the cases listed above where this scheme won't work. (Even if you do the fallback in the kernel, you'll still need userspace fallback for non-Linux systems, and for when the application is run on an older Linux kernel that doesn't have all of this O_ATOMIC or link/unlink magic). That's what I suggested, as well. Then we're in agreement. :-) - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/2011010319.gg11...@thunk.org
Re: Safe File Update (atomic)
On Mon, Jan 3, 2011 at 6:28 AM, Enrico Weigelt weig...@metux.de wrote: * Ted Ts'o ty...@mit.edu schrieb: This is possible. It would be specific only to file systems that support inodes (i.e., ix-nay for NFS, FAT, etc.). FAT supports inodes ? ix-nay: no/except Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimc5atwf5j8mcrbkpw+kp4chxp2ey=xurplj...@mail.gmail.com
Re: Safe File Update (atomic)
On Mon, Jan 3, 2011 at 11:35 AM, Shachar Shemesh shac...@debian.org wrote: On 02/01/11 17:37, Olaf van der Spek wrote: A userspace lib is fine with me. In fact, I've been asking for it multiple times. Result: no response. Excuse me? You (well, Henrique, but you were CCed) said how about a user space lib? I Yes, sorry, you are. It was aimed at the people at the linux lists. said I'm working on one, will be ready about this weekend. I even gave a URL to watch (https://github.com/Shachar/safewrite). If you check it out right now, you will find there a fully implemented and fairly debugged user space solution, even including a build tool and man page. I did look into it when I read the original post. Will look again and provide feedback. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinqrmx3gqx0sg-9pqbg6snittab_4cehx5qc...@mail.gmail.com
Re: Safe File Update (atomic)
On Mon, Jan 03, 2011 at 12:26:29PM +0100, Olaf van der Spek wrote: Given that the issue has come up before so often, I expected there to be a FAQ about it. Your asking the question over (and over... and over...) doesn't make it an FAQ. :-) Aside from your asking over and over, it hasn't come up that often, actually. The right answer has been known for decades, and it's is very simple; write a temp file, copy over the xattr's and ACL's if you care (in many cases, such as an application's private state files, it won't care, so it can skip this step --- it's only the more generic file editors that would need to worry about such things --- but when's the last time anyone has really worried about xattr's on a .c file?), fsync(), and rename(). This is *not* hard. People who get it wrong are just being lazy. In the special case of dpkg, where they are writing a moderate number of large files, and they care about syncing the files without causing journal commits, the use of sync_file_range() on the files followed by a series of fdatasync() calls has solved their issues as far as I know. - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110103144335.gd6...@thunk.org
Re: Safe File Update (atomic)
On Mon, Jan 3, 2011 at 3:43 PM, Ted Ts'o ty...@mit.edu wrote: On Mon, Jan 03, 2011 at 12:26:29PM +0100, Olaf van der Spek wrote: Given that the issue has come up before so often, I expected there to be a FAQ about it. Your asking the question over (and over... and over...) doesn't make it an FAQ. :-) Haha, right. But file loss issues have come up before. Let's wait for the userspace lib. Aside from your asking over and over, it hasn't come up that often, actually. The right answer has been known for decades, and it's is very simple; write a temp file, copy over the xattr's and ACL's if you care (in many cases, such as an application's private state files, it won't care, so it can skip this step --- it's only the more generic file editors that would need to worry about such things --- but when's the last time anyone has really worried about xattr's on a .c file?), fsync(), and rename(). This is *not* hard. People who get it wrong are just being lazy. True, that's why right/safe-by-default would be nice to have. In the special case of dpkg, where they are writing a moderate number of large files, and they care about syncing the files without causing journal commits, the use of sync_file_range() on the files followed by a series of fdatasync() calls has solved their issues as far as I know. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktikfyvny9oaq9ren5wxztjrodmogfjuj7opz+...@mail.gmail.com
Re: Safe File Update (atomic)
Ted Ts'o tytso at mit.edu writes: actually. The right answer has been known for decades, and it's is very simple; write a temp file, copy over the xattr's and ACL's if you care (in many cases, such as an application's private state files, it won't care, so it can skip this step --- it's only the more generic file editors that would need to worry about such things --- but when's the last time anyone has really worried about xattr's on a .c file?), fsync(), and rename(). This is *not* hard. People who get it wrong are just being lazy. IMO calling a recipe containing fsync() the right answer is wrong. For the clear majority of programs waiting for a disk-level write is not the correct semantics, and using fsync does cause real problems, the recent dpkg issues here being just one example. IMO telling people to use fsync does more harm than good; rather we should be telling them to not use fsync unless they really know what they're doing. In another post in this thread you also talk about how we've managed to live with the current functionality for three decades. We've managed to live, but what exactly is the practice we've lived with? I'd say an essential part of it has been the recipe of write temp file + rename, _without_ doing an fsync. Yes, it may not have been theoretically crash safe on all filesystems; but in practice, the practice that has allowed things to work for decades, the filesystems have either been safe or the machines stable enough for it to not become an issue. If this is no longer true then that is a reason why things are now different from previous decades and why it's now necessary to add new functionality. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/loom.20110103t195257-...@post.gmane.org
Re: Safe File Update (atomic)
On Sun, 02 Jan 2011, Ted Ts'o wrote: And of course, Olaf isn't actually offerring to implement this hypothetical O_ATOMIC. Oh, no! He's just petulently demanding it, even though he can't give us any concrete use cases where this would actually be a huge win over a userspace safe-write library that properly uses fsync() and rename(). Olaf, O_ATOMIC is difficult in the kernel sense and in the long run. It is an API that is too hard to implement in a sane way, with too many boundary conditions. OTOH, you don't need O_ATOMIC. You need a way for easy application access to a saner/simpler way to deal with files that require atomic replacement. Time to switch to a plan B that can achieve it. Do not lose track of your final goal, and stop wasting time with O_ATOMIC (and aggravating fs developers, which can only hurt your goal in the end). Maybe there are ways to actually let the kernel detect usage patterns and do the right thing, but nobody found any that is complete (and the incomplete ones are implemented in ext3 and ext4 AFAIK). If an userspace library is built to do all the dances required using only POSIX APIs (you can use extensions where they are available to enhance performance) you will have an EXACT list of boundary conditions and choke points. With that exact list of requirements in hand and something that can be easily regression-tested, it gets a LOT easier to talk to any fs developer and to the glibc developers, and come up with the kernel and glibc enhancements needed to accelerate it (or remove boundary conditions) that are acceptable to both sides. In the end, because POSIX _is_ crap in many ways, you will have some boundary conditions that cannot be removed or worked around. It is likely that they will not be so serious flaws that it will make the whole idea unusable. Maybe they will apply only to some filesystems (just like right now there are some things you simply don't use NFSv3 for). If you have other ideas that have no weird side-effects or troublesome semanthics, I am sure you'd have a better chance of it happenning. They're probably not going to take the form of open() flags for the same reason O_ATOMIC has problems, but who knows. If I had a good idea about how to solve this problem, I'd have already written paper about it or something. Well, that's it. I have nothing else to contribute to this thread. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110102125258.ga6...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On Sun, Jan 2, 2011 at 8:09 AM, Ted Ts'o ty...@mit.edu wrote: You could ask for a new (non-POSIX?) API that does not ask of a POSIX-like filesystem something it cannot provide (i.e. don't ask for something that requires inode-path reverse mappings). You could ask for syscalls to copy inodes, etc. You could ask for whatever is needed to do a (open+write+close) that is atomic if the target already exists. Maybe one of those has a better chance than O_ATOMIC. The O_ATOMIC open flag is highly problematic, and it's not fully specified. What if the system is under a huge amount of memory pressure, and the badly behaved application program does: fd = open(file, O_ATOMIC | O_TRUNC); write(fd, buf, 2*1024*1024*1024); // write 2 gigs, heh, heh heh sleep for one day write(fd, buf2, 1024); close(fd); Last time you ignored my response, but let's try again. The implementation would be comparable to using a temp file, so there's no need to keep 2 g in memory. Write the 2 g to disk, wait one day, append the 1 k, fsync, update inode. What happens if another program opens file for reading during the one day sleep period? Does it get the the old contents of file? Of course, according to the definition of atomic. The partially written, incomplete new version of file? What happens if the file is currently mmap'ed, as Henrique has asked? Didn't I respond to that too? Again, old file. What if another program opens the file O_ATOMIC during the one day sleep period, so the file is in the middle of getting updated by two different processes using O_ATOMIC? Again equivalent to using the rename trick. One of the updates will win and since they don't depend on the old contents there are no troubles. How exactly do the semantics for O_ATOMIC work? And given at the momment ***zero*** file systems implement O_ATOMIC, what should an application do as a fallback? And given that it is Fallback could be implement in the kernel or in userland. Using rename as a fallback sounds reasonable. Implementations could switch to O_ATOMIC when available. highly unlikely this could ever be implemented for various file systems including NFS, I'll observe this won't really reduce application complexity, since you'll always need to have a fallback for file systems and kernels that don't support O_ATOMIC. I don't see a reason why this couldn't be implemented by NFS. And what are the use cases where this really makes sense? Will people Lots of files are written in one go. They could all use this interface. really code to this interface, knowing that it only works on Linux (there are other operating systems, out there, like FreeBSD and FreeBSD, Solaris and AIX probably also care about file consistency. Discussing this proposal with them would be a good idea. Solaris and AIX, you know, and some application programmers _do_ care about portability), and the only benefits are (a) a marginal performance boost for insane people who like to write vast number of 2-4 byte files without any need for atomic updates across a large number of these small files, and (b) the ability to keep the the file owner unchanged when someone other than the owner updates said file (how important is this _really_; what is the use case where this really matters?). As you've said yourself, a lot of apps don't get this right. Why not? Because the safe way is much more complex than the unsafe way. APIs should be easy to use right and hard to misuse. With O_ATOMIC, I feel this is the case. Without, it's the opposite and the consequences are obvious. There shouldn't be a tradeoff between safety and potential problems. O_ATOMIC is merely a proposed way to solve this problem. I've asked (you) for a concrete code example to do it without O_ATOMIC support, but nobody has been able to provide one yet. And of course, Olaf isn't actually offerring to implement this hypothetical O_ATOMIC. Oh, no! He's just petulently demanding it, even though he can't give us any concrete use cases where this would actually be a huge win over a userspace safe-write library that properly uses fsync() and rename(). Not true. I've asked (you) for just such a lib, but I'm still waiting for an answer. Why would anyone work on an implementation if there's no agreement about it? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktiml0-7go=rfyt+wtonjeinqh9zqo5rpnf3c8...@mail.gmail.com
Re: Safe File Update (atomic)
On Sun, Jan 2, 2011 at 1:52 PM, Henrique de Moraes Holschuh h...@debian.org wrote: Olaf, O_ATOMIC is difficult in the kernel sense and in the long run. It is an API that is too hard to implement in a sane way, with too many boundary conditions. OTOH, you don't need O_ATOMIC. You need a way for easy application access to a saner/simpler way to deal with files that require atomic replacement. Time to switch to a plan B that can achieve it. Do not lose track of your final goal, and stop wasting time with O_ATOMIC (and aggravating fs developers, which can only hurt your goal in the end). Maybe I wasn't clear, in that case I'm sorry. To me, O_ATOMIC is mostly about the userspace API. The implementation isn't (that) important, so you're right. Maybe there are ways to actually let the kernel detect usage patterns and do the right thing, but nobody found any that is complete (and the incomplete ones are implemented in ext3 and ext4 AFAIK). If an userspace library is built to do all the dances required using only POSIX APIs (you can use extensions where they are available to enhance performance) you will have an EXACT list of boundary conditions and choke points. A userspace lib is fine with me. In fact, I've been asking for it multiple times. Result: no response. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktikmkh=bxzwybhqdn_geog=qkkbs-efkxgex3...@mail.gmail.com
Re: Safe File Update (atomic)
On Sun, 02 Jan 2011, Olaf van der Spek wrote: On Sun, Jan 2, 2011 at 1:52 PM, Henrique de Moraes Holschuh h...@debian.org wrote: Olaf, O_ATOMIC is difficult in the kernel sense and in the long run. It is an API that is too hard to implement in a sane way, with too many boundary conditions. OTOH, you don't need O_ATOMIC. You need a way for easy application access to a saner/simpler way to deal with files that require atomic replacement. Time to switch to a plan B that can achieve it. Do not lose track of your final goal, and stop wasting time with O_ATOMIC (and aggravating fs developers, which can only hurt your goal in the end). Maybe I wasn't clear, in that case I'm sorry. To me, O_ATOMIC is mostly about the userspace API. The implementation isn't (that) important, so you're right. Ok. Here is one meta-API that could be useful (and yes, it is likely mostly exactly what you call O_ATOMIC. Whatever, my body is at 38.4°C right now and the ferver is still climbing, so I don't even claim perfect sanity at the moment. Ted, if I could impose on you a single question, please either reply with a short no, already explained why the idea below is bogus elsewhere, no, new idea but wouldn't work because of a,b,c, no, but I don't care to explain why right now, and yes, could work depending on the details. I won't pester you about it. 1. Create unlinked file fd (benefits from kernel support, but doesn't require it). If a filesystem cannot support this or the boundary conditions are unaceptable, fail. Needs to know the destination name to do the unliked create on the right fs and directory (otherwise attempts to link the file later would have to fail if the fs is different). 2. fd works as any normal fd to an unlinked regular file. 3. create a link() that can do unlink+link atomically. Maybe this already exists, otherwise needs kernel support. The behaviour of (3) should allow synchrous wait of a fsync() and a sync of the metadata of the parent dir. It doesn't matter much if it does everything, or just calling fsync(), or creating a fclose() variant that does it. Whether this should map to O_ATOMIC in glibc or be something new, I don't care. But if it is a flag, I'd highly suggest naming it O_CREATEUNLINKED or something else that won't give people wrong ideas, as _nothing_ but the final inode linking is atomic. This will work for other uses, too. It is a safe and easy way to create temporary files for ipc, etc. Or not, maybe it is completely broken and I should not write while in a ferver. A userspace lib is fine with me. In fact, I've been asking for it multiple times. Result: no response. You will need to actually find someone who wants to write such lib, or pay someone to, or fire up a public funds campaign and contract it from someone the community would trust to actually be able to complete the job, etc. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110102171441.ga6...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On Sun, Jan 02, 2011 at 04:14:15PM +0100, Olaf van der Spek wrote: On Sun, Jan 2, 2011 at 8:09 AM, Ted Ts'o ty...@mit.edu wrote: The O_ATOMIC open flag is highly problematic, and it's not fully specified. Note that on the other side of the fence there's something called TxF (Transactional NTFS). I don't know how fast or reliable it is, but browsing the docs shows some interesting things. In particular, it is not limited to a single file but can handle any number of changes to the filesystem. What if the system is under a huge amount of memory pressure, and the badly behaved application program does: fd = open(file, O_ATOMIC | O_TRUNC); write(fd, buf, 2*1024*1024*1024); // write 2 gigs, heh, heh heh sleep for one day write(fd, buf2, 1024); close(fd); Last time you ignored my response, but let's try again. The implementation would be comparable to using a temp file, so there's no need to keep 2 g in memory. Write the 2 g to disk, wait one day, append the 1 k, fsync, update inode. And what if you're changing one byte inside a 50 GB file? I see an easy implementation on btrfs/ocfs2 (assuming no other writers), but on ext3/4, that'd be tricky. What if another program opens the file O_ATOMIC during the one day sleep period, so the file is in the middle of getting updated by two different processes using O_ATOMIC? Again equivalent to using the rename trick. One of the updates will win and since they don't depend on the old contents there are no troubles. On NTFS, an attempt to open a file for writing twice fails if at least one of you and the other writer use TxF. This goes contrary to the usual Unix semantics (where you can always open the file for writing) but it is how SQL works. NTFS has bad lock granularity (the whole file rather than a row, page or a byte range), but is straightforward. How exactly do the semantics for O_ATOMIC work? And given at the momment ***zero*** file systems implement O_ATOMIC, I'd count TxF as an implementation. what should an application do as a fallback? And given that it is Fallback could be implement in the kernel or in userland. Using rename as a fallback sounds reasonable. Implementations could switch to O_ATOMIC when available. For large files using reflink (currently implemented as fs-specific ioctls) can ensure performance. It can give you anything but the abuse for preserving owner (ie, the topic of this thread). To get that, you'd need in-kernel support, but for example http://lwn.net/Articles/331808/ proposes an API which is just a thin wrapper over existing functionality in multiple filesystems. It basically duplicates an inode, preserving all current attributes but making any new writes CoWed. If you make the old one immutable, you get the TxF semantics (mandatory write lock), if you don't, you'll get the mentioned above one of the updates will win data loss. highly unlikely this could ever be implemented for various file systems including NFS, I'll observe this won't really reduce application complexity, since you'll always need to have a fallback for file systems and kernels that don't support O_ATOMIC. I don't see a reason why this couldn't be implemented by NFS. Not sure how extensible NFS is, but it's just a matter of passing these calls over network to the underlying filesystem. Ie, the problem can be divided into doing this locally (see above) and extending NFS. And what are the use cases where this really makes sense? Will people Lots of files are written in one go. They could all use this interface. I don't see how O_ATOMIC helps there. TxF transactions would work (all writes either succeed together or none does), but O_ATOMIC can't do more than one file. the only benefits are (a) a marginal performance boost for insane people who like to write vast number of 2-4 byte files without any need for atomic updates across a large number of these small files, and (b) the ability to keep the the file owner unchanged when someone other than the owner updates said file (how important is this _really_; what is the use case where this really matters?). As you've said yourself, a lot of apps don't get this right. Why not? Because the safe way is much more complex than the unsafe way. APIs should be easy to use right and hard to misuse. With O_ATOMIC, I feel this is the case. Without, it's the opposite and the consequences are obvious. There shouldn't be a tradeoff between safety and potential problems. Uhm, but you didn't answer the question. These two use cases Ted Tso mentioned are certainly not worth the complexity of in-kernel support, O_ATOMIC doesn't bring other goodies, and the rest can be done by an userspace library which is indeed a good idea. O_ATOMIC is merely a proposed way to solve this problem. I've asked (you) for a concrete code example to do it without O_ATOMIC support, but nobody has been able to
Re: Safe File Update (atomic)
On Sun, Jan 02, 2011 at 04:14:15PM +0100, Olaf van der Spek wrote: Last time you ignored my response, but let's try again. The implementation would be comparable to using a temp file, so there's no need to keep 2 g in memory. Write the 2 g to disk, wait one day, append the 1 k, fsync, update inode. Write the 2g to disk *where*? Some random assigned blocks? And using *what* to keep track of the where to find all of the metadata blocks? That information is normally stored in the inode, but you don't want to touch it. So we need to store it someplace, and you haven't specified where. Some alternate universe? Another inode, which is only tied to that file descriptor? That's *possible*, but it's (a) not at all trivial, and (b) won't work for all file systems. It definitely won't work for FAT based file systems, so your blithe, oh, just emulate it in the kernel is rather laughable. If you think it's so easy, *you* go implement it. How exactly do the semantics for O_ATOMIC work? And given at the momment ***zero*** file systems implement O_ATOMIC, what should an application do as a fallback? And given that it is Fallback could be implement in the kernel or in userland. Using rename as a fallback sounds reasonable. Implementations could switch to O_ATOMIC when available. Using rename as a fallback means exposing random temp file names into the directory. Which could conflict with files that the userspace might want to create. It could be done, but again, it's an awful lot of complexity to shove into the kernel. highly unlikely this could ever be implemented for various file systems including NFS, I'll observe this won't really reduce application complexity, since you'll always need to have a fallback for file systems and kernels that don't support O_ATOMIC. I don't see a reason why this couldn't be implemented by NFS. Try it; it should become obvious fairly quickly. Or just go read the NFS protocol specifications. As you've said yourself, a lot of apps don't get this right. Why not? Because the safe way is much more complex than the unsafe way. APIs should be easy to use right and hard to misuse. With O_ATOMIC, I feel this is the case. Without, it's the opposite and the consequences are obvious. There shouldn't be a tradeoff between safety and potential problems. Application programmers have in the past been unwilling to change their applications. If they are willing to change their applications, they can just as easily use a userspace library, or use fsync() and rename() properly. If they aren't willing to change their programs and recompile (and the last time we've been around this block, they weren't; they just blamed the file system), asking them to use O_ATOMIC probably won't work, given the portiability issues. And of course, Olaf isn't actually offerring to implement this hypothetical O_ATOMIC. Oh, no! He's just petulently demanding it, even though he can't give us any concrete use cases where this would actually be a huge win over a userspace safe-write library that properly uses fsync() and rename(). Not true. I've asked (you) for just such a lib, but I'm still waiting for an answer. Pay someone enough money, and they'll write you the library. Whining about it petulently and expecting someone else to write it is probably not going to work. Quite frankly, if you're competent enough to use it, you should be able to write such a library yourself. If you aren't going to be using it yourself, they why are you wasting everyone's time on this? - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110103032549.gc11...@thunk.org
Re: Safe File Update (atomic)
On Sun, Jan 02, 2011 at 03:14:41PM -0200, Henrique de Moraes Holschuh wrote: 1. Create unlinked file fd (benefits from kernel support, but doesn't require it). If a filesystem cannot support this or the boundary conditions are unaceptable, fail. Needs to know the destination name to do the unliked create on the right fs and directory (otherwise attempts to link the file later would have to fail if the fs is different). This is possible. It would be specific only to file systems that support inodes (i.e., ix-nay for NFS, FAT, etc.). Some file systems would want to know a likely directory where the file would be linked so for their inode and block allocation policies can optimize the inode and block placement. 2. fd works as any normal fd to an unlinked regular file. 3. create a link() that can do unlink+link atomically. Maybe this already exists, otherwise needs kernel support. The behaviour of (3) should allow synchrous wait of a fsync() and a sync of the metadata of the parent dir. It doesn't matter much if it does everything, or just calling fsync(), or creating a fclose() variant that does it. OK, so this is where things get trickly. The first is you are asking for the ability to take a file descriptor and link it into some directory. The inode associated with the fd might or might not be already linked to some other directory, and it might or might not be owned by the user trying to do the link. The latter could get problematical if quota is enabled, since it does open up a new potential security exposure. A user might pass a file descriptor to another process in a different security domain, and that process could create a link to some directory which the original user doesn't have access to. The user would no longer be able to delete file and drop quota, and the process would retain permanent access to the file, which it might not otherwise have if the inode was protected by a parent directory's permissions. It's for the same reason that we can't just implement open-by-inode-number; even if you use the inode's permissions and ACL's to do a access check, this allows someone to bypass security controls based on the containing directory's permissions. It might not be a security exposure, but for some scenarios (i.e., a mode 600 ~/Private directory that contains world-readable files), it changes accessibility of some files. We could control for this by only allowing the link to happen if the user executing this new system call owns the inode being linked, so this particular problem is addressable. The larger problem is this doesn't solve give you any performance benefits over simply creating a temporary file, fsync'ing it, and then doing the rename. And it doesn't solve the problem that userspace is responsible for copying over the extended attributes and ACL information. So in exchange for doing something non-portable which is Linux specific, and won't work on FAT, NFS, and other non-inode based file systems at all, and which requires special file-system modifications for inode-based file systems --- the only real benefit you get is that the temp file gets cleaned up automatically if you crash before the link/unlink new magical system call is completed. Is it worth it? I'm not at all convinced. Can this be fixed? Well, I suppose we could have this magical link/unlink system call also magically copy over the xattr and acl's. And if you don't care about when things happen, you could have the kernel fork off a kernel thread, which does the fsync, followed by the magic ACL and xattr copying, and once all of this completes, it could do the magic link/unlink. So we could bundle all of this into a system call. *Theoretically*. But then someone else will say that they want to know when this magic link/unlink system call actually completes. Others might say that they don't care about the fsync happening right away, but would rather wait some arbitary time, and let the system writeback algorithsm write back the file *whenever*, but only when the file is written back *whenever*, should the rest of the magical link/unlink happen. So now we have an explosion of complexity, with all sorts of different variants. And there's also the problem where if you don't do don't make the system call synchronous (where it does an fsync() and waits for it to complete), you'll lose the ability to report errors back to userspace. Which gets me back to the question of use cases. When are we going to be using this monster? For many use cases, where the original reason why we said people were doing it wrong because they weren't doing things right, the risk was losing data. But if you don't do things synchronously, and use fsync(), you'll also end up risking losing data because you won't know about write failures --- specifically, your program may have long exited by the time the write failure is noticed by the kernel. But if you use make the system call synchronous, now there's no
Re: Safe File Update (atomic)
* Ted Ts'o ty...@mit.edu schrieb: This is possible. It would be specific only to file systems that support inodes (i.e., ix-nay for NFS, FAT, etc.). FAT supports inodes ? IIRC it puts all file information (including attributes and first data block) directly into the dirent ... Some file systems would want to know a likely directory where the file would be linked so for their inode and block allocation policies can optimize the inode and block placement. Interesting. Do you know of some which do that and maybe some studies on whether that's worth it ? A user might pass a file descriptor to another process in a different security domain, and that process could create a link to some directory which the original user doesn't have access to. The user would no longer be able to delete file and drop quota, and the process would retain permanent access to the file, which it might not otherwise have if the inode was protected by a parent directory's permissions. Just curious: does the fd passing duplicate the fd or pass it as-is ? (so multiple processes have access to the same fd instance instead of just the same inode ?) 1) You care about data loss in the case of power failure, but not in the case of hard drive or storage failure, *AND* you are writing tons and tons of tiny 3-4 byte files and so you are worried about performance because you're doing something insane with large number of small files. I'd be careful w/ declaring use of tons of small files insane. Sure, this might call for a database, but hierachical filesystems also might be a good interface for hierachical key-value-lists. (for this, an read-at-once/write-at-once syscall would be nice ;-)). 3) You care about the temp file used by the userspace library, or application which is doing the write temp file, fsync(), rename() scheme, being automatically deleted in case of a system crash or a process getting sent an uncatchable signal and getting terminated. Indeed, an automatic garbage collection for temp files would be nice. But that also could be done by an new flag which tells the kernel to automatically remove those files when the holding process terminates. But this information would also have to be permanently recorded somewhere, so the gc can still clean them up after hard reboot. cu -- -- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weig...@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 -- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme -- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110103052826.ga14...@nibiru.local
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 5:08 PM, Enrico Weigelt weig...@metux.de wrote: Not true. Renaming a running executable works just fine, for example. Well, has been quite a while since I last used Windows, but IIRC renaming an running executable was denied. Maybe on FAT. However, that's OT. Why not designing an new (overlay'ing) filesystem for that ? Increased complexity, lower performance, little benefit. Why that ? Currently applications (try to) implement that all on their own, which needs great efforts for multiprocess synchronization. Having that in a little fileserver eases this sychronization and moves the complexity to a single point. I mean compared to implementing it properly in the kernel. Doing it in the kernel would be fine (maybe DLM could be used here), What's DLM? but would be a nonportable solution for quite a long time ;-o Since it's the only proper solution I don't think that's a problem. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinzgo=u85r4mjaxuslkzdha7_yhbfz2ylfu7...@mail.gmail.com
Re: Safe File Update (atomic)
Olaf van der Spek olafvds...@gmail.com (01/01/2011): Doing it in the kernel would be fine (maybe DLM could be used here), What's DLM? CONFIG_DLM. KiBi. signature.asc Description: Digital signature
Re: Safe File Update (atomic)
On Sat, Jan 1, 2011 at 7:13 PM, Cyril Brulebois k...@debian.org wrote: Olaf van der Spek olafvds...@gmail.com (01/01/2011): Doing it in the kernel would be fine (maybe DLM could be used here), What's DLM? CONFIG_DLM. DLM seems independent of atomic updates. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktim5=w67itcnx7fkrmsuddstt=05zggbuwq32...@mail.gmail.com
Re: Safe File Update (atomic)
* Olaf van der Spek olafvds...@gmail.com schrieb: Doing it in the kernel would be fine (maybe DLM could be used here), What's DLM? Distributed lock manager. but would be a nonportable solution for quite a long time ;-o Since it's the only proper solution I don't think that's a problem. I doubt that the only proper solution. As said, an (userland) filesystem could also do fine. cu -- -- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weig...@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 -- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme -- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110101230343.gd10...@nibiru.local
Re: Safe File Update (atomic)
On Sun, Jan 2, 2011 at 12:03 AM, Enrico Weigelt weig...@metux.de wrote: I doubt that the only proper solution. As said, an (userland) filesystem could also do fine. Do you think distros like Debian would install such a setup by default? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinsexfkrcdzf1d1ia44z=logx0cic46z39vf...@mail.gmail.com
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 09:51:50AM -0200, Henrique de Moraes Holschuh wrote: On Fri, 31 Dec 2010, Olaf van der Spek wrote: Ah, hehe. BTW, care to respond to the mail I send to you? There is nothing more I can add to this thread. You want O_ATOMIC. It cannot be implemented for all use cases of the POSIX API, so it will not be implemented by the kernel. That's all there is to it, AFAIK. You could ask for a new (non-POSIX?) API that does not ask of a POSIX-like filesystem something it cannot provide (i.e. don't ask for something that requires inode-path reverse mappings). You could ask for syscalls to copy inodes, etc. You could ask for whatever is needed to do a (open+write+close) that is atomic if the target already exists. Maybe one of those has a better chance than O_ATOMIC. The O_ATOMIC open flag is highly problematic, and it's not fully specified. What if the system is under a huge amount of memory pressure, and the badly behaved application program does: fd = open(file, O_ATOMIC | O_TRUNC); write(fd, buf, 2*1024*1024*1024); // write 2 gigs, heh, heh heh sleep for one day write(fd, buf2, 1024); close(fd); What happens if another program opens file for reading during the one day sleep period? Does it get the the old contents of file? The partially written, incomplete new version of file? What happens if the file is currently mmap'ed, as Henrique has asked? What if another program opens the file O_ATOMIC during the one day sleep period, so the file is in the middle of getting updated by two different processes using O_ATOMIC? How exactly do the semantics for O_ATOMIC work? And given at the momment ***zero*** file systems implement O_ATOMIC, what should an application do as a fallback? And given that it is highly unlikely this could ever be implemented for various file systems including NFS, I'll observe this won't really reduce application complexity, since you'll always need to have a fallback for file systems and kernels that don't support O_ATOMIC. And what are the use cases where this really makes sense? Will people really code to this interface, knowing that it only works on Linux (there are other operating systems, out there, like FreeBSD and Solaris and AIX, you know, and some application programmers _do_ care about portability), and the only benefits are (a) a marginal performance boost for insane people who like to write vast number of 2-4 byte files without any need for atomic updates across a large number of these small files, and (b) the ability to keep the the file owner unchanged when someone other than the owner updates said file (how important is this _really_; what is the use case where this really matters?). And of course, Olaf isn't actually offerring to implement this hypothetical O_ATOMIC. Oh, no! He's just petulently demanding it, even though he can't give us any concrete use cases where this would actually be a huge win over a userspace safe-write library that properly uses fsync() and rename(). If someone were to pay me a huge amount of money, and told me what was the file size range where such a thing would be used, and what sort of application would need it, and what kind of update frequency it should be optimized for, and other semantic details about parallel O_ATOMIC updates, what happens to users who are in the middle of reading the file, what are the implications for quota, etc., it's certainly something I can entertain. But at the moment, it's a vague specification (not even a solution) looking for a problem. - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110102070922.ga6...@thunk.org
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 3:17 AM, Henrique de Moraes Holschuh h...@debian.org wrote: On Thu, 30 Dec 2010, Henrique de Moraes Holschuh wrote: BTW: safely removing a file is also tricky. AFAIK, one must open it RW, in exclusive mode. stat it by fd and check whether it is what one expects (regular file, ownership). unlink it by fd. close the fd. Eh, as it was pointed to me by private mail, this is obviously a load of crap :p There is no unlink by fd. Sorry about that. The attacks here are races by messing with intermediate path components, which are either not worth bothering with, or have to be avoided in a much more convoluted manner. Ah, hehe. BTW, care to respond to the mail I send to you? -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinnyxtf2czhkfrmkw_gpp39h5uqu2j8oz1cs...@mail.gmail.com
Re: Safe File Update (atomic)
On Fri, 31 Dec 2010, Olaf van der Spek wrote: Ah, hehe. BTW, care to respond to the mail I send to you? There is nothing more I can add to this thread. You want O_ATOMIC. It cannot be implemented for all use cases of the POSIX API, so it will not be implemented by the kernel. That's all there is to it, AFAIK. You could ask for a new (non-POSIX?) API that does not ask of a POSIX-like filesystem something it cannot provide (i.e. don't ask for something that requires inode-path reverse mappings). You could ask for syscalls to copy inodes, etc. You could ask for whatever is needed to do a (open+write+close) that is atomic if the target already exists. Maybe one of those has a better chance than O_ATOMIC. It is up to you and the fs developers to find some common ground. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101231115150.gb31...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 12:51 PM, Henrique de Moraes Holschuh h...@debian.org wrote: On Fri, 31 Dec 2010, Olaf van der Spek wrote: Ah, hehe. BTW, care to respond to the mail I send to you? There is nothing more I can add to this thread. You want O_ATOMIC. It That's a shame. I thought I provided pretty concrete answers. cannot be implemented for all use cases of the POSIX API, so it will not be implemented by the kernel. That's all there is to it, AFAIK. You could ask for a new (non-POSIX?) API that does not ask of a POSIX-like filesystem something it cannot provide (i.e. don't ask for What's the definition of a POSIX-like FS? something that requires inode-path reverse mappings). You could ask for syscalls to copy inodes, etc. You could ask for whatever is needed To me, inodes are an implementation detail that shouldn't be exposed. to do a (open+write+close) that is atomic if the target already exists. Maybe one of those has a better chance than O_ATOMIC. It is up to you and the fs developers to find some common ground. The FS devs are happy with all the regressions of the workaround, so they're unlikely to do anything. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktikw9372od-eufevczv8dtxorbagslq3mc...@mail.gmail.com
Re: Safe File Update (atomic)
* Olaf van der Spek olafvds...@gmail.com schrieb: something that requires inode-path reverse mappings). You could ask for syscalls to copy inodes, etc. You could ask for whatever is needed To me, inodes are an implementation detail that shouldn't be exposed. Well, they're an fundamental concept which sometimes *IS* significant to the applications. It's very different from systems where each file has exactly one name (eg. DOS/Windows) or where there're just filesnames that point to opaque stream objects that can be virtually anything (eg. Plan9). to do a (open+write+close) that is atomic if the target already exists. Maybe one of those has a better chance than O_ATOMIC. It is up to you and the fs developers to find some common ground. The FS devs are happy with all the regressions of the workaround, so they're unlikely to do anything. Why not designing an new (overlay'ing) filesystem for that ? cu -- -- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weig...@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 -- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme -- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101231135711.gb10...@nibiru.local
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 2:57 PM, Enrico Weigelt weig...@metux.de wrote: To me, inodes are an implementation detail that shouldn't be exposed. Well, they're an fundamental concept which sometimes *IS* significant to the applications. It's very different from systems where each file has exactly one name (eg. DOS/Windows) or where there're just filesnames that point to opaque stream objects that can be virtually anything (eg. Plan9). Sometimes, indeed. This number of times should be as low as possible. to do a (open+write+close) that is atomic if the target already exists. Maybe one of those has a better chance than O_ATOMIC. It is up to you and the fs developers to find some common ground. The FS devs are happy with all the regressions of the workaround, so they're unlikely to do anything. Why not designing an new (overlay'ing) filesystem for that ? Increased complexity, lower performance, little benefit. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinq1aucfw2fkjiqwz=y2k4hoor87zbhfq8nb...@mail.gmail.com
Re: Safe File Update (atomic)
* Olaf van der Spek olafvds...@gmail.com schrieb: Well, they're an fundamental concept which sometimes *IS* significant to the applications. It's very different from systems where each file has exactly one name (eg. DOS/Windows) or where there're just filesnames that point to opaque stream objects that can be virtually anything (eg. Plan9). Sometimes, indeed. This number of times should be as low as possible. These cases aren't that rare. Windows, for example, tends to deny renames on open files, as they're also identified by the filename. (yes, there're other solutions for this problem, eg. having some internal-only inode numbering, etc). It's important to understand, that on *nix, filenames are not representing the files directly, but just a pointer to them (somewhat comparable to DNS entries), where other platforms directly use the filename as primary identification (sometimes even as primary key). This has great implications on the semantics of the filesystem. Why not designing an new (overlay'ing) filesystem for that ? Increased complexity, lower performance, little benefit. Why that ? Currently applications (try to) implement that all on their own, which needs great efforts for multiprocess synchronization. Having that in a little fileserver eases this sychronization and moves the complexity to a single point. cu -- -- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weig...@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 -- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme -- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101231144455.ga29...@nibiru.local
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 03:44:56PM +0100, Enrico Weigelt wrote: * Olaf van der Spek olafvds...@gmail.com schrieb: Well, they're an fundamental concept which sometimes *IS* significant to the applications. It's very different from systems where each file has exactly one name (eg. DOS/Windows) or where there're just filesnames that point to opaque stream objects that can be virtually anything (eg. Plan9). Sometimes, indeed. This number of times should be as low as possible. These cases aren't that rare. Windows, for example, tends to deny renames on open files, as they're also identified by the filename. (yes, there're other solutions for this problem, eg. having some internal-only inode numbering, etc). I would like to point out that this specific issue is why Windows needs to be rebooted so often compared to Unix systems. This is one situation where inodes really shine. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 signature.asc Description: Digital signature
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 3:44 PM, Enrico Weigelt weig...@metux.de wrote: * Olaf van der Spek olafvds...@gmail.com schrieb: Well, they're an fundamental concept which sometimes *IS* significant to the applications. It's very different from systems where each file has exactly one name (eg. DOS/Windows) or where there're just filesnames that point to opaque stream objects that can be virtually anything (eg. Plan9). Sometimes, indeed. This number of times should be as low as possible. These cases aren't that rare. Windows, for example, tends to deny I mean that apps shouldn't have to know about inodes. renames on open files, as they're also identified by the filename. Not true. Renaming a running executable works just fine, for example. (yes, there're other solutions for this problem, eg. having some internal-only inode numbering, etc). It's important to understand, that on *nix, filenames are not representing the files directly, but just a pointer to them (somewhat comparable to DNS entries), where other platforms directly use the filename as primary identification (sometimes even as primary key). This has great implications on the semantics of the filesystem. Why not designing an new (overlay'ing) filesystem for that ? Increased complexity, lower performance, little benefit. Why that ? Currently applications (try to) implement that all on their own, which needs great efforts for multiprocess synchronization. Having that in a little fileserver eases this sychronization and moves the complexity to a single point. I mean compared to implementing it properly in the kernel. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimzsvy_g8+r2zooz=skb0tza86kot2qb-eh8...@mail.gmail.com
Re: Safe File Update (atomic)
On Fri, Dec 31, 2010 at 3:58 PM, brian m. carlson sand...@crustytoothpaste.net wrote: These cases aren't that rare. Windows, for example, tends to deny renames on open files, as they're also identified by the filename. (yes, there're other solutions for this problem, eg. having some internal-only inode numbering, etc). I would like to point out that this specific issue is why Windows needs to be rebooted so often compared to Unix systems. This is one situation where inodes really shine. I didn't say inodes are bad. I said apps shouldn't have to know about them. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlkti=upmdxkkfmx5ly8nfxndmobg55f3yrpuygy...@mail.gmail.com
Re: Safe File Update (atomic)
* Olaf van der Spek olafvds...@gmail.com schrieb: renames on open files, as they're also identified by the filename. Not true. Renaming a running executable works just fine, for example. Well, has been quite a while since I last used Windows, but IIRC renaming an running executable was denied. Why not designing an new (overlay'ing) filesystem for that ? Increased complexity, lower performance, little benefit. Why that ? Currently applications (try to) implement that all on their own, which needs great efforts for multiprocess synchronization. Having that in a little fileserver eases this sychronization and moves the complexity to a single point. I mean compared to implementing it properly in the kernel. Doing it in the kernel would be fine (maybe DLM could be used here), but would be a nonportable solution for quite a long time ;-o cu -- -- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weig...@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 -- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme -- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101231160803.gc10...@nibiru.local
Re: Safe File Update (atomic)
On Wed, 29 Dec 2010, Olaf van der Spek wrote: Writing a temp file, fsync, rename is often proposed. However, the It is: write temp file (in same directory as file to be replaced), fsync temp file[1], rename (atomic), fsync directory[2]. [1] Makes sure file data has been commited to backend device before the metadata update [2] Makes sure the metadata has been commited to permantent storage. Can often be ignored when you don't really care to know you will get the new contents (as opposed to the old contents) in case of a crash. MTAs and spools, for example, MUST do it. Which steps you can skip is filesystem-options/filesystem/ kernel-version/kernel dependent. When the rename acts as a barrier, [1] can be skipped, for example. Tracking this is a losing proposition. If we could use some syscall to make [1] into a simple barrier request (guaranteed to degrade to fsync if barriers are not operating), it would be better performance-wise. This is what one should request of libc and the kernels with a non-zero chance of getting it implemented (in fact, it might even already exist). I've brought this up on linux-fsdevel and linux-ext4 but they (Ted) claim those exceptions aren't really a problem. Indeed they are not. Code has been dealing with them for years. You name the temp file properly, and teach your program to clean old ones up *safely* (see vim swap file handling for an example) when it starts. vim is a good example: nobody gets surprised by vim swap-files left over when vim/computer crashes. And vim will do something smart with them if it finds them in the current directory when it is started. BTW: safely removing a file is also tricky. AFAIK, one must open it RW, in exclusive mode. stat it by fd and check whether it is what one expects (regular file, ownership). unlink it by fd. close the fd. Is there a code snippet or lib function that handles this properly? I don't know. I'd be interested in the answer, though :-) -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101230114655.ga19...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 12:46 PM, Henrique de Moraes Holschuh h...@debian.org wrote: write temp file (in same directory as file to be replaced), fsync temp What if the target name is actually a symlink? To a different volume? What if you're not allowed to create a file in that dir. If we could use some syscall to make [1] into a simple barrier request (guaranteed to degrade to fsync if barriers are not operating), it would be better performance-wise. This is what one should request of libc and the kernels with a non-zero chance of getting it implemented (in fact, it might even already exist). My proposal was O_ATOMIC: // begin transaction open(fname, O_ATOMIC | O_TRUNC); write; // 0+ times close; Seems like the ideal API from the app's point of view. I've brought this up on linux-fsdevel and linux-ext4 but they (Ted) claim those exceptions aren't really a problem. Indeed they are not. Code has been dealing with them for years. You Code has been wrong for years to, based on the reason reports about file corruption with ext4. name the temp file properly, and teach your program to clean old ones up *safely* (see vim swap file handling for an example) when it starts. What about restoring meta-data? File-owner? vim is a good example: nobody gets surprised by vim swap-files left over when vim/computer crashes. And vim will do something smart with them if it finds them in the current directory when it is started. I'm sure the vim code is far from trivial. I think this complexity is part of the reason most apps don't bother. BTW: safely removing a file is also tricky. AFAIK, one must open it RW, in exclusive mode. stat it by fd and check whether it is what one Exclusive mode? Linux doesn't know about mandatory locking (AFAIK). expects (regular file, ownership). unlink it by fd. close the fd. Is there a code snippet or lib function that handles this properly? I don't know. I'd be interested in the answer, though :-) I'll ask glibc. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktikm+dacfnq7lort9vo7p-m-gvn0dgqxup5au...@mail.gmail.com
Re: Safe File Update (atomic)
On 30/12/10 13:46, Henrique de Moraes Holschuh wrote: Is there a code snippet or lib function that handles this properly? I don't know. I'd be interested in the answer, though :-) I'm working on one under the MIT license. Will probably release it by the end of this week. Will also handle copying the permissions over and following symlinks. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d1c9d3b.6060...@debian.org
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 3:51 PM, Shachar Shemesh shac...@shemesh.biz wrote: I'm working on one under the MIT license. Will probably release it by the end of this week. Will also handle copying the permissions over and following symlinks. Sounds great! Got a project page already? What aboue file owner? Meta-data (ACL)? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktik-o2mu47dfdvm8kedobjfhw7swkxcwy9fwh...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 03:30:29PM +0100, Olaf van der Spek wrote: name the temp file properly, and teach your program to clean old ones up *safely* (see vim swap file handling for an example) when it starts. What about restoring meta-data? File-owner? owner, permissions, acl, xattrs, and whatever other future stuff can be stored about files, which then all applications should be made aware of? Yay for simplicity. Mike -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101230151011.ga12...@glandium.org
Re: Safe File Update (atomic)
On 30/12/10 17:02, Olaf van der Spek wrote: On Thu, Dec 30, 2010 at 3:51 PM, Shachar Shemeshshac...@shemesh.biz wrote: I'm working on one under the MIT license. Will probably release it by the end of this week. Will also handle copying the permissions over and following symlinks. Sounds great! Got a project page already? No. I was doing it as code to accompany an article on my company's site about how it should be done. I was originally out to write the article, and then decided to add code. A good thing, too, as recursively resolving symbolic links is not trivial. There is an extremely simple way to do it on Linux, but it will not work on all platforms (the *BSD platforms, including Mac, do not have /proc by default). What aboue file owner? Meta-data (ACL)? Olaf The current code (I'm still working on it, or I would have released it already, but it's about 80% done) does copy owner data over (but ignores failures), but does not handle ACLs. I decided to postpone this particular hot potato until I can get a chance to see how to do it (i.e. - never had a chance on Linux) AND how to do it in a cross-platform way (the code is designed to work on any Posix). Pointers/patches once released are, of course, welcome :-) Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d1ca143.9020...@debian.org
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 4:12 PM, Shachar Shemesh shac...@debian.org wrote: No. I was doing it as code to accompany an article on my company's site about how it should be done. I was originally out to write the article, and then decided to add code. A good thing, too, as recursively resolving symbolic links is not trivial. There is an extremely simple way to do it on Linux, but it will not work on all platforms (the *BSD platforms, including Mac, do not have /proc by default). Depending on /proc is probably not reasonable. Are you sure it will be atomic? ;) What aboue file owner? Meta-data (ACL)? Olaf The current code (I'm still working on it, or I would have released it already, but it's about 80% done) does copy owner data over (but ignores failures), but does not handle ACLs. I decided to postpone this particular How do you preserve owner (as non-root)? hot potato until I can get a chance to see how to do it (i.e. - never had a chance on Linux) AND how to do it in a cross-platform way (the code is designed to work on any Posix). Pointers/patches once released are, of course, welcome :-) The reason I asked for a kernelland solution is because it's hard if not impossible to do properly in userland. But some kernel devs (Ted and others) don't agree. They reason that the desire to preserve all meta-data isn't reasonable by itself. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktik93zn1yjf5xyq_+rhaonrj1bszcafpnmkrt...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, 30 Dec 2010, Olaf van der Spek wrote: On Thu, Dec 30, 2010 at 12:46 PM, Henrique de Moraes Holschuh h...@debian.org wrote: write temp file (in same directory as file to be replaced), fsync temp What if the target name is actually a symlink? To a different volume? Indeed. You have to check that first, of course :-( This is about safe handling of such functions, symlinks always have to be derreferenced and their target checked. After that, you operate on the target, if the symlink changes, your operations will not. What if you're not allowed to create a file in that dir. You fail the write. Or the user has to request the unsafe handling (truncate + write). Or you have to detect it will happen and switch modes if you're allowed to. If we could use some syscall to make [1] into a simple barrier request (guaranteed to degrade to fsync if barriers are not operating), it would be better performance-wise. This is what one should request of libc and the kernels with a non-zero chance of getting it implemented (in fact, it might even already exist). My proposal was O_ATOMIC: // begin transaction open(fname, O_ATOMIC | O_TRUNC); write; // 0+ times close; Seems like the ideal API from the app's point of view. POSIX filesystems do not support it, so you'd need glibc to do everything your application would have to get that atomicity. I.e. it should go in a separate lib, anyway, and you will have to code for it in the app :( It is not transparent. It cannot be. What about mmap()? What about read+write patterns? At most you could have an open+write+close function that encapsulate most of the crap, with a few options to tell it what to do if it finds a symlink or mismatched owner, what to do if it cannot do it in an atomic way, etc. I suppose one could actually ask for a non-posix interface to do all those three operations in one syscall, but I don't think the kernel people will want to implement it. It would make sense only if object stores become commonplace (where this thing is likely an object store primitive, anyway). I've brought this up on linux-fsdevel and linux-ext4 but they (Ted) claim those exceptions aren't really a problem. Indeed they are not. Code has been dealing with them for years. You Code has been wrong for years to, based on the reason reports about file corruption with ext4. Code written to *deal with files safely* by people who wanted to get it right and actually checked what needs to be done, has been right for years. And has piss-poor performance. Code written by random joe which has no clue about the braindamages of POSIX and Unix, well... this thread shows how much crap is really needed. One can, obviously, have most filesystems be super-safe, and create a new fadvise or something to say this is crap, be unsafe if you can. Performance will be poor, everything will be safe, and the extra fsyncs() will not hurt much because the fs would do it anyway. name the temp file properly, and teach your program to clean old ones up *safely* (see vim swap file handling for an example) when it starts. What about restoring meta-data? File-owner? Hmm, yes, more steps if you want to do something like that, as you must do it with the target open in exclusive mode. close target only after the rename went ok. But if the file owner is not yourself, you really should change it, not to mention you might not want to complete the operation in the first place. A lib for this is a really good idea :p vim is a good example: nobody gets surprised by vim swap-files left over when vim/computer crashes. And vim will do something smart with them if it finds them in the current directory when it is started. I'm sure the vim code is far from trivial. I think this complexity is part of the reason most apps don't bother. That I agree with completely. BTW: safely removing a file is also tricky. AFAIK, one must open it RW, in exclusive mode. stat it by fd and check whether it is what one Exclusive mode? Linux doesn't know about mandatory locking (AFAIK). Yeah... races everywhere... expects (regular file, ownership). unlink it by fd. close the fd. Is there a code snippet or lib function that handles this properly? I don't know. I'd be interested in the answer, though :-) I'll ask glibc. This really should be in a separate lib. You want it to be usable outside of glibc systems, and you CAN implement it (slow that it will be) on anything POSIX. You need only some help of the kernel to speed it up, and that has to be detected at compile time (support) and runtime (availability of the feature) anyway. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe.
Re: Safe File Update (atomic)
On Thu, 30 Dec 2010, Olaf van der Spek wrote: The reason I asked for a kernelland solution is because it's hard if not impossible to do properly in userland. But some kernel devs (Ted and others) don't agree. They reason that the desire to preserve all meta-data isn't reasonable by itself. It isn't. And you can do it anyway: 1. open target, keep it open. 2. do the safe open+write dance on the temp target. 3. get metadata from target by fd 4. apply metadata to temp target by fd 5. atomic rename 6. close both fds 7. sync parent dir. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101230152401.gb4...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 4:20 PM, Henrique de Moraes Holschuh h...@debian.org wrote: What if the target name is actually a symlink? To a different volume? Indeed. You have to check that first, of course :-( This is about safe handling of such functions, symlinks always have to be derreferenced and their target checked. After that, you operate on the target, if the symlink changes, your operations will not. That's not really atomic. What if you're not allowed to create a file in that dir. You fail the write. That's a regression from the non-atomic case. Or the user has to request the unsafe handling (truncate + write). Or you have to detect it will happen and switch modes if you're allowed to. If we could use some syscall to make [1] into a simple barrier request (guaranteed to degrade to fsync if barriers are not operating), it would be better performance-wise. This is what one should request of libc and the kernels with a non-zero chance of getting it implemented (in fact, it might even already exist). My proposal was O_ATOMIC: // begin transaction open(fname, O_ATOMIC | O_TRUNC); write; // 0+ times close; Seems like the ideal API from the app's point of view. POSIX filesystems do not support it, so you'd need glibc to do everything Not yet, but I assume it'll be added when there's enough demand. your application would have to get that atomicity. I.e. it should go in a separate lib, anyway, and you will have to code for it in the app :( Why would it have to go in a separate lib? It is not transparent. It cannot be. What about mmap()? What about read+write patterns? They either happen before or after this atomic transaction. Comparable to the rename workaround. At most you could have an open+write+close function that encapsulate most of the crap, with a few options to tell it what to do if it finds a symlink or mismatched owner, what to do if it cannot do it in an atomic way, etc. I suppose one could actually ask for a non-posix interface to do all those three operations in one syscall, but I don't think the kernel people will There's no need for a single syscall. want to implement it. It would make sense only if object stores become commonplace (where this thing is likely an object store primitive, anyway). Nah. Tons of files are written in one go. All could use this atomic flag. I've brought this up on linux-fsdevel and linux-ext4 but they (Ted) claim those exceptions aren't really a problem. Indeed they are not. Code has been dealing with them for years. You Code has been wrong for years to, based on the reason reports about file corruption with ext4. Code written to *deal with files safely* by people who wanted to get it right and actually checked what needs to be done, has been right for years. And has piss-poor performance. Isn't fixing / improving that a good thing? Code written by random joe which has no clue about the braindamages of POSIX and Unix, well... this thread shows how much crap is really needed. So you agree that this should be improved? One can, obviously, have most filesystems be super-safe, and create a new fadvise or something to say this is crap, be unsafe if you can. Performance will be poor, everything will be safe, and the extra fsyncs() will not hurt much because the fs would do it anyway. I actually think this can be done with better performance then the rename workaround. name the temp file properly, and teach your program to clean old ones up *safely* (see vim swap file handling for an example) when it starts. What about restoring meta-data? File-owner? Hmm, yes, more steps if you want to do something like that, as you must do it with the target open in exclusive mode. close target only after the rename went ok. But if the file owner is not yourself, you really should change it, not to mention you might not want to complete the operation in the first place. Why? Of course write access to the file is required. I'll ask glibc. This really should be in a separate lib. You want it to be usable outside of glibc systems, and you CAN implement it (slow that it will be) on anything POSIX. You need only some help of the kernel to speed it up, and that has to be detected at compile time (support) and runtime (availability of the feature) anyway. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinhoftnychhjsd6og04jrvyube8ul55szyyl...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 4:24 PM, Henrique de Moraes Holschuh h...@debian.org wrote: On Thu, 30 Dec 2010, Olaf van der Spek wrote: The reason I asked for a kernelland solution is because it's hard if not impossible to do properly in userland. But some kernel devs (Ted and others) don't agree. They reason that the desire to preserve all meta-data isn't reasonable by itself. It isn't. Why not? And you can do it anyway: 1. open target, keep it open. 2. do the safe open+write dance on the temp target. 3. get metadata from target by fd 4. apply metadata to temp target by fd 5. atomic rename 6. close both fds 7. sync parent dir. Doesn't work for file-owner. How does it handle meta-data you don't know about yet? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimgqaavbzgwndr6bf87=1bvb1au++qje29d+...@mail.gmail.com
Re: Safe File Update (atomic)
On 30/12/10 13:46, Henrique de Moraes Holschuh wrote: Is there a code snippet or lib function that handles this properly? I don't know. I'd be interested in the answer, though :-) I'm working on one under the MIT license. Will probably release it by the end of this week. Will also handle copying the permissions over and following symlinks. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d1c9c74.2050...@shemesh.biz
Re: Safe File Update (atomic)
On Thu, 30 Dec 2010, Olaf van der Spek wrote: On Thu, Dec 30, 2010 at 4:24 PM, Henrique de Moraes Holschuh h...@debian.org wrote: On Thu, 30 Dec 2010, Olaf van der Spek wrote: The reason I asked for a kernelland solution is because it's hard if not impossible to do properly in userland. But some kernel devs (Ted and others) don't agree. They reason that the desire to preserve all meta-data isn't reasonable by itself. It isn't. Why not? You touched it, it is not the same file/inode anymore. How does it handle meta-data you don't know about yet? It doesn't. You need a copy inode without the file data filesystem interface to be able to do that in the first place. It might exist, but I never heard of it. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101230174822.ga20...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 6:48 PM, Henrique de Moraes Holschuh h...@debian.org wrote: Why not? You touched it, it is not the same file/inode anymore. That's again a regression from the non-atomic case. How does it handle meta-data you don't know about yet? It doesn't. You need a copy inode without the file data filesystem interface to be able to do that in the first place. It might exist, but I never heard of it. You wouldn't need that with O_ATOMIC. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimyynyyyw2osbkbg8wxgv2ybrdotzymzlu83...@mail.gmail.com
Re: Safe File Update (atomic)
On 30/12/10 19:48, Henrique de Moraes Holschuh wrote: It doesn't. You need a copy inode without the file data filesystem interface to be able to do that in the first place. It might exist, but I never heard of it. If my (extremely leaky) memory serves me right, Windows has it. It's called delete and then rename. It is not atomic (since when do Windows care about not breaking stuff), but it does exactly that. If you delete a file and quickly (yes, this feature is time based) rename a different file to the same name, the new file will receive all metadata information the old file had (including owner, permissions etc.) Just thought I'd share this little nugget to show you how much worse non-posix has it. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d1ccc38.6000...@debian.org
Re: Safe File Update (atomic)
On 30/12/10 17:17, Olaf van der Spek wrote: On Thu, Dec 30, 2010 at 4:12 PM, Shachar Shemeshshac...@debian.org wrote: No. I was doing it as code to accompany an article on my company's site about how it should be done. I was originally out to write the article, and then decided to add code. A good thing, too, as recursively resolving symbolic links is not trivial. There is an extremely simple way to do it on Linux, but it will not work on all platforms (the *BSD platforms, including Mac, do not have /proc by default). Depending on /proc is probably not reasonable. Are you sure it will be atomic? ;) open old file, get fd (we'll assume it's 5). Do readlink on /proc/self/fd/5, and get file's real path. Do everything in said path. It's atomic, in the sense that the determining point in time is the point in which you opened the old file. How do you preserve owner (as non-root)? I thought I answered that. Best effort. You perform the chown, but do not bother with the return code. If it succeeded, great. If not, well, you did your best. The reason I asked for a kernelland solution is because it's hard if not impossible to do properly in userland. But some kernel devs (Ted and others) don't agree. They reason that the desire to preserve all meta-data isn't reasonable by itself. I'm with Henrique on that one. I am more concerned with the amount of non-Posix code that needs to go into this than preserving all attributes. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d1ccd81.3010...@debian.org
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 7:15 PM, Shachar Shemesh shac...@debian.org wrote: If my (extremely leaky) memory serves me right, Windows has it. It's called delete and then rename. It is not atomic (since when do Windows care about not breaking stuff), but it does exactly that. If you delete a file and quickly (yes, this feature is time based) rename a different file to the same name, the new file will receive all metadata information the old file had (including owner, permissions etc.) Just thought I'd share this little nugget to show you how much worse non-posix has it. You're kidding me. Got any source to back this up? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktik8ywzth67auoukrxmt2w1urmpgahnbg4k9s...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 7:20 PM, Shachar Shemesh shac...@debian.org wrote: Depending on /proc is probably not reasonable. Are you sure it will be atomic? ;) open old file, get fd (we'll assume it's 5). Do readlink on /proc/self/fd/5, and get file's real path. Do everything in said path. It's atomic, in the sense that the determining point in time is the point in which you opened the old file. How do you preserve owner (as non-root)? I thought I answered that. Best effort. You perform the chown, but do not bother with the return code. If it succeeded, great. If not, well, you did your best. Ah. Another regression. The reason I asked for a kernelland solution is because it's hard if not impossible to do properly in userland. But some kernel devs (Ted and others) don't agree. They reason that the desire to preserve all meta-data isn't reasonable by itself. I'm with Henrique on that one. I am more concerned with the amount of non-Posix code that needs to go into this than preserving all attributes. With kernel support you would only need a single non-POSIX flag. Please be sure to document all assumptions / limitations of your variant. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimzgfzwpj8phtevdycbxwwd5s7pp+enlcpi+...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, 2010-12-30 at 19:29 +0100, Olaf van der Spek wrote: On Thu, Dec 30, 2010 at 7:15 PM, Shachar Shemesh shac...@debian.org wrote: If my (extremely leaky) memory serves me right, Windows has it. It's called delete and then rename. It is not atomic (since when do Windows care about not breaking stuff), but it does exactly that. If you delete a file and quickly (yes, this feature is time based) rename a different file to the same name, the new file will receive all metadata information the old file had (including owner, permissions etc.) Just thought I'd share this little nugget to show you how much worse non-posix has it. You're kidding me. Got any source to back this up? http://support.microsoft.com/?kbid=172190 Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse. signature.asc Description: This is a digitally signed message part
Re: Safe File Update (atomic)
On Thu, Dec 30, 2010 at 7:46 PM, Ben Hutchings b...@decadent.org.uk wrote: You're kidding me. Got any source to back this up? http://support.microsoft.com/?kbid=172190 Interesting. Although no longer available on Vista / 7. Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinuqjcgdg0aazqkiomfthqyorfzc89y7xquu...@mail.gmail.com
Re: Safe File Update (atomic)
On Thu, 30 Dec 2010, Henrique de Moraes Holschuh wrote: BTW: safely removing a file is also tricky. AFAIK, one must open it RW, in exclusive mode. stat it by fd and check whether it is what one expects (regular file, ownership). unlink it by fd. close the fd. Eh, as it was pointed to me by private mail, this is obviously a load of crap :p There is no unlink by fd. Sorry about that. The attacks here are races by messing with intermediate path components, which are either not worth bothering with, or have to be avoided in a much more convoluted manner. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101231021723.ga9...@khazad-dum.debian.net
Re: Safe File Update (atomic)
On 30/12/10 17:02, Olaf van der Spek wrote: Got a project page already? Watch this space. Actual code coming soon(tm). https://github.com/Shachar/safewrite Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d1d743b.8080...@shemesh.biz
Safe File Update (atomic)
Since the introduction of ext4, some apps/users have had issues with file corruption after a system crash. It's not a bug in the FS AFAIK and it's not exclusive to ext4. Writing a temp file, fsync, rename is often proposed. However, the durable aspect of fsync isn't always required and this way has other issues, like resetting file owner, maybe losing meta-data, requiring permission to create the temp file and having the temp file visible (shortly, or permanently after a crash). I've brought this up on linux-fsdevel and linux-ext4 but they (Ted) claim those exceptions aren't really a problem. Is there a code snippet or lib function that handles this properly? What do you think about the exceptions? Olaf -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimz6ui+l76h=f1frtefb=-daghoeacvnjsp7...@mail.gmail.com