On Fri, May 29, 2015 at 10:00 AM, Richard Yao <[email protected]> wrote:
> It seems that no one is particularly interested in a strict point in time > version, > To be clear, I'm not saying that I'm uninterested in that, just that it seems hard/complicated to achieve in combination with other requirements (specifically, not needing to have the entire state in memory at once). (And I don't know that anyone else has weighed in.) > so I'll drop that design constraint, document the issue and hope userland > programmers relying on the output will actually read the documentation > carefully so that they do not make the naive assumption that I imagine that > they will make. > FWIW, userland programmers already need to be aware of these issues when iterating over directory entries (e.g. if a file/directory is concurrently renamed, "find" can list it 0, 1, or 2 times). Concurrent renames are pretty difficult to deal with. --matt > On 29 May 2015 at 12:24, Richard Yao <[email protected]> wrote: > >> >> >> On 29 May 2015 at 12:22, Richard Yao <[email protected]> wrote: >> >>> It seems that I neglected to include the CC list in my last reply. My >>> apologies for that. >>> >>> On 29 May 2015 at 11:42, Matthew Ahrens <[email protected]> wrote: >>> >>>> >>>> >>>> On Fri, May 29, 2015 at 8:34 AM, Richard Yao <[email protected] >>>> > wrote: >>>> >>>>> >>>>> >>>>> On 29 May 2015 at 11:25, Matthew Ahrens <[email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, May 29, 2015 at 8:09 AM, Richard Yao < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I should add that the purpose of the pipe is to avoid situations >>>>>>> where iteration takes arbitrarily long due to needing to allocate enough >>>>>>> memory in userland and having the kernel/userspace race with increases >>>>>>> in >>>>>>> memory requirements as it iterates (or having things hang from too >>>>>>> many/large pools to list). >>>>>>> >>>>>> >>>>>> How much memory are we talking about? (few MB? few GB?) >>>>>> >>>>>> I don't see what would be racy about the situation you described. >>>>>> >>>>> >>>>> Imagine having a trillion datasets on a system with only 1 gigabyte of >>>>> memory. If we need some constant amount of memory per dataset, that should >>>>> be required in userland and if userland can avoid it, that is fine. >>>>> >>>> >>>> You have a pool with a trillion datasets? Impressive! How long did it >>>> take to create that? The most I've seen is on the order of 100,000. >>>> >>> >>> I have not yet made one, but one thing that makes sense for a stable API >>> is to ensure that its ability to be used does not depend on the relative >>> size of the system memory and pools imported. I had considered relaxing >>> that idea for the initial implementation to avoid introducing a potential >>> CVE on systems with user delegation support, but I have come to realize >>> that my first attempt to avoid introducing a CVE by relaxing that failed to >>> export a point in time snapshot of the state to userspace. >>> >>> >>>> >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> On 29 May 2015 at 11:06, Richard Yao <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Dear Matt, >>>>>>>> >>>>>>>> I am trying to solve is the absence of a sane way to get the state >>>>>>>> in a manner similar to `zfs list` from libzfs_core. >>>>>>>> >>>>>>> >>>>>> If you're OK with the semantics of "zfs list" (everything it tells >>>>>> you was true at some point during the list operation), you could do the >>>>>> same thing with libzfs_core. >>>>>> >>>>> >>>>> This has the same race that rsync has which sending a snapshot fixes, >>>>> except with `zfs list` rather than filesystem subtrees. It has the >>>>> potential to cause nasty bugs where userland makes assumptions about >>>>> output >>>>> reflecting something that is actually there. >>>>> >>>> >>>> If you need the output to reflect "something that is actually there", >>>> then you need to prevent changes across not just the "zfs list", but also >>>> the userland code that examines the output and takes action based on it. >>>> Otherwise, as soon as we return to userland the output could be wrong >>>> (because something was create/delete/renamed after the list completed but >>>> before the user process runs). >>>> >>> >>> Having to worry about the output not being consistent with the in-core >>> state seems harder than having to worry about the output not being a point >>> in time snapshot. The difference is that you can have things like a dataset >>> appear twice or not at all due to a rename. There are other possibilities. >>> One that dawns on me is what happens when userland has dependencies on the >>> order in which properties are updated. These could be arbitrary user >>> properties or something already there such as the mountpoint property >>> and/or those having to do with security. >>> >>> It is non-obvious that these edge cases are even possible to a potential >>> userland programmer and it would seem wise to try to avoid them. >>> >>> >>>> >>>>> >>>>> >>>>>> >>>>>>> The current API is non-atomic and I was concerned about memory >>>>>>>> utilization on large/many pools, so I wrote a lzc_list that operated >>>>>>>> using >>>>>>>> a pipe while holding locks. When porting that to Illumos, I realized >>>>>>>> that >>>>>>>> these locks could be held arbitrarily long by userland, so I came up >>>>>>>> with a >>>>>>>> second approach that did holds. Unfortunately, this lead the code to >>>>>>>> use >>>>>>>> linear memory with the number of datasets and hurt atomicity, so I >>>>>>>> came up >>>>>>>> with the idea of pinning the the last txg and using that as output. >>>>>>>> >>>>>>> >>>>>> I don't really know what you mean by "pinning the last txg". (Beyond >>>>>> not allowing its blocks to be overwritten -- see the questions in my >>>>>> previous email.) >>>>>> >>>>> >>>>> What I mean is that we would not allow the blocks to be overridden >>>>> until the list operation finishes. >>>>> >>>> >>>> Can you explain how that helps? Are you doing the list from an >>>> independent implementation of ZFS (e.g. like zdb)? Because the kernel's >>>> view of the state (e.g. contents of dsl_dataset_phys_t, results of >>>> zap_lookup) is going to continue changing even if you don't overwrite any >>>> blocks on disk. >>>> >>> >>> I was about doing a hidden read-only second import of the pool at the >>> last txg (with the current import pinning those blocks), output the data >>> and undo the operation to ensure that we get a point in time view of the >>> state without memory kernel requirements increasing with the output. >>> >>> That said, I have considered 4 main variations of listing output with >>> that being the fourth. It is similar to the idea of "eventual consistency" >>> used in distributed systems and could be considered to be analogous to the >>> snapshots that we do now. The first variation was to do what we do now at >>> the kernel side of the pipe. The second was locking things to prevent >>> modifying operations from occurring until output finishes. The third was to >>> do holds, which unfortunately pins down memory and isn't a proper atomic >>> view. >>> >>> My latest idea is to do a mix of #2 and #1, with an "atomic" flag >>> switching before them and making it into a privileged operation. #4 could >>> be added if the atomic flag passed as a string saying "lazy", but I suspect >>> that just #1 and #2 would be sufficient to make consumers happy. The >>> existence of the atomic flag would alert userland programmers about the >>> races in listing output and those that need assistance in avoiding them can >>> do so as long as they do things in a manner that is privileged. >>> >> >> To be clear, this should be "between", not "before". Also, the atomic >> flag selecting #2 would be the privileged version. >> >> >>> >>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>> There seems to be no way to list datasets in a way that is >>>>>>>> simultaneously atomic, non-blocking, memory efficient and consistent >>>>>>>> (where >>>>>>>> we get the latest state). >>>>>>>> >>>>>>> >>>>>> That is indeed nontrivial. For example, there's no way to do that >>>>>> for listing a directory hierarchy. Can you describe what each of those >>>>>> requirements is exactly (e.g. what is the difference between "atomic" and >>>>>> "consistent"?) I can guess at the others but it would be best to lay out >>>>>> your requirements explicitly. >>>>>> >>>>> >>>>> Here, I am using the term atomicity to mean that the output was true >>>>> at some point in time and the term consistent to refer to the latest >>>>> state. >>>>> >>>> >>>> So it would be impossible for the output to be "consistent" (refer to >>>> latest state) without locking that crosses system calls (see above). >>>> >>>> >>>>> >>>>> >>>>>> >>>>>> My latest thoughts are to implement lzc_list with an "atomic" toggle >>>>>>>> that does the first way that I implemented things >>>>>>>> >>>>>>> >>>>>> Meaning it grabs locks to prevent create/destroy/rename operations >>>>>> from taking place while the list is in progress? >>>>>> >>>>> >>>>> Yes. >>>>> >>>>> >>>>>> >>>>>> as a privileged operation that by default requires root in the global >>>>>>>> zone on Illumos and pushes the non-atomic code we have now into the >>>>>>>> kernel >>>>>>>> when it is not. >>>>>>>> >>>>>>>> What do you think? >>>>>>>> >>>>>>>> Yours truly, >>>>>>>> Richard Yao >>>>>>>> >>>>>>>> >>>>>>>> On 27 May 2015 at 17:08, Matthew Ahrens <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 27, 2015 at 7:23 PM, Richard Yao < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Dear Everyone, >>>>>>>>>> >>>>>>>>>> As some people know, I have been working on libzfs_core >>>>>>>>>> extensions and I currently have a prototype lzc_list command that is >>>>>>>>>> a >>>>>>>>>> large subset of the functionality of `zfs list`. >>>>>>>>>> >>>>>>>>>> After discussing it with others, I suspect that implementing an >>>>>>>>>> in-core pool metadata snapshot facility for lzc_list would be the >>>>>>>>>> most >>>>>>>>>> natural way of implementing lzc_list. The in-core pool snapshot would >>>>>>>>>> atomically pin the metadata state of a pool on disk for `zfs list` >>>>>>>>>> without >>>>>>>>>> holding locks while it is traversed (my first implementation) or >>>>>>>>>> requiring >>>>>>>>>> that we pin memory via holds on dsl_dataset_t objects until the >>>>>>>>>> operation >>>>>>>>>> is finished (my second implementation). While the in-core pool >>>>>>>>>> metadata >>>>>>>>>> snapshot is in effect, the blocks containing the pool metadata would >>>>>>>>>> not be >>>>>>>>>> marked free in the in-core free space_map, but it would be marked >>>>>>>>>> free in >>>>>>>>>> the on-disk space map when it would normally be marked as such. That >>>>>>>>>> has >>>>>>>>>> the downside that disk space would not be freed right away, but we >>>>>>>>>> make no >>>>>>>>>> guarantees of immediately freeing disk space anyway, so I suspect >>>>>>>>>> that is >>>>>>>>>> okay. >>>>>>>>>> >>>>>>>>>> Would this be something entirely new or is there already a way to >>>>>>>>>> snapshot the pool's metadata state in-core either of which I am >>>>>>>>>> unaware or >>>>>>>>>> in a branch somewhere? >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Let's say for the sake of argument that we don't overwrite >>>>>>>>> anything on disk that you care about. What else do you need to do? >>>>>>>>> I'm >>>>>>>>> imagining that you will have a separate idea of the metadata state >>>>>>>>> (what >>>>>>>>> datasets exist, their properties and interrelations, etc), which is >>>>>>>>> out of >>>>>>>>> date from what's really on disk. How do you maintain that? It seems >>>>>>>>> nontrivial. >>>>>>>>> >>>>>>>>> Maybe you could start by describing the problem that you are >>>>>>>>> solving? It sounds like you want an atomic view of the pool metadata >>>>>>>>> (that's used by "zfs list")? >>>>>>>>> >>>>>>>>> --matt >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
