On 29 May 2015 at 12:22, Richard Yao <[email protected]> wrote:
> It seems that I neglected to include the CC list in my last reply. My > apologies for that. > > On 29 May 2015 at 11:42, Matthew Ahrens <[email protected]> wrote: > >> >> >> On Fri, May 29, 2015 at 8:34 AM, Richard Yao <[email protected]> >> wrote: >> >>> >>> >>> On 29 May 2015 at 11:25, Matthew Ahrens <[email protected]> wrote: >>> >>>> >>>> >>>> On Fri, May 29, 2015 at 8:09 AM, Richard Yao <[email protected] >>>> > wrote: >>>> >>>>> I should add that the purpose of the pipe is to avoid situations where >>>>> iteration takes arbitrarily long due to needing to allocate enough memory >>>>> in userland and having the kernel/userspace race with increases in memory >>>>> requirements as it iterates (or having things hang from too many/large >>>>> pools to list). >>>>> >>>> >>>> How much memory are we talking about? (few MB? few GB?) >>>> >>>> I don't see what would be racy about the situation you described. >>>> >>> >>> Imagine having a trillion datasets on a system with only 1 gigabyte of >>> memory. If we need some constant amount of memory per dataset, that should >>> be required in userland and if userland can avoid it, that is fine. >>> >> >> You have a pool with a trillion datasets? Impressive! How long did it >> take to create that? The most I've seen is on the order of 100,000. >> > > I have not yet made one, but one thing that makes sense for a stable API > is to ensure that its ability to be used does not depend on the relative > size of the system memory and pools imported. I had considered relaxing > that idea for the initial implementation to avoid introducing a potential > CVE on systems with user delegation support, but I have come to realize > that my first attempt to avoid introducing a CVE by relaxing that failed to > export a point in time snapshot of the state to userspace. > > >> >>> >>>> >>>> >>>>> >>>>> On 29 May 2015 at 11:06, Richard Yao <[email protected]> >>>>> wrote: >>>>> >>>>>> Dear Matt, >>>>>> >>>>>> I am trying to solve is the absence of a sane way to get the state in >>>>>> a manner similar to `zfs list` from libzfs_core. >>>>>> >>>>> >>>> If you're OK with the semantics of "zfs list" (everything it tells you >>>> was true at some point during the list operation), you could do the same >>>> thing with libzfs_core. >>>> >>> >>> This has the same race that rsync has which sending a snapshot fixes, >>> except with `zfs list` rather than filesystem subtrees. It has the >>> potential to cause nasty bugs where userland makes assumptions about output >>> reflecting something that is actually there. >>> >> >> If you need the output to reflect "something that is actually there", >> then you need to prevent changes across not just the "zfs list", but also >> the userland code that examines the output and takes action based on it. >> Otherwise, as soon as we return to userland the output could be wrong >> (because something was create/delete/renamed after the list completed but >> before the user process runs). >> > > Having to worry about the output not being consistent with the in-core > state seems harder than having to worry about the output not being a point > in time snapshot. The difference is that you can have things like a dataset > appear twice or not at all due to a rename. There are other possibilities. > One that dawns on me is what happens when userland has dependencies on the > order in which properties are updated. These could be arbitrary user > properties or something already there such as the mountpoint property > and/or those having to do with security. > > It is non-obvious that these edge cases are even possible to a potential > userland programmer and it would seem wise to try to avoid them. > > >> >>> >>> >>>> >>>>> The current API is non-atomic and I was concerned about memory >>>>>> utilization on large/many pools, so I wrote a lzc_list that operated >>>>>> using >>>>>> a pipe while holding locks. When porting that to Illumos, I realized that >>>>>> these locks could be held arbitrarily long by userland, so I came up >>>>>> with a >>>>>> second approach that did holds. Unfortunately, this lead the code to use >>>>>> linear memory with the number of datasets and hurt atomicity, so I came >>>>>> up >>>>>> with the idea of pinning the the last txg and using that as output. >>>>>> >>>>> >>>> I don't really know what you mean by "pinning the last txg". (Beyond >>>> not allowing its blocks to be overwritten -- see the questions in my >>>> previous email.) >>>> >>> >>> What I mean is that we would not allow the blocks to be overridden until >>> the list operation finishes. >>> >> >> Can you explain how that helps? Are you doing the list from an >> independent implementation of ZFS (e.g. like zdb)? Because the kernel's >> view of the state (e.g. contents of dsl_dataset_phys_t, results of >> zap_lookup) is going to continue changing even if you don't overwrite any >> blocks on disk. >> > > I was about doing a hidden read-only second import of the pool at the last > txg (with the current import pinning those blocks), output the data and > undo the operation to ensure that we get a point in time view of the state > without memory kernel requirements increasing with the output. > > That said, I have considered 4 main variations of listing output with that > being the fourth. It is similar to the idea of "eventual consistency" used > in distributed systems and could be considered to be analogous to the > snapshots that we do now. The first variation was to do what we do now at > the kernel side of the pipe. The second was locking things to prevent > modifying operations from occurring until output finishes. The third was to > do holds, which unfortunately pins down memory and isn't a proper atomic > view. > > My latest idea is to do a mix of #2 and #1, with an "atomic" flag > switching before them and making it into a privileged operation. #4 could > be added if the atomic flag passed as a string saying "lazy", but I suspect > that just #1 and #2 would be sufficient to make consumers happy. The > existence of the atomic flag would alert userland programmers about the > races in listing output and those that need assistance in avoiding them can > do so as long as they do things in a manner that is privileged. > To be clear, this should be "between", not "before". Also, the atomic flag selecting #2 would be the privileged version. > >> >>> >>> >>>> >>>> >>>>> >>>>>> There seems to be no way to list datasets in a way that is >>>>>> simultaneously atomic, non-blocking, memory efficient and consistent >>>>>> (where >>>>>> we get the latest state). >>>>>> >>>>> >>>> That is indeed nontrivial. For example, there's no way to do that for >>>> listing a directory hierarchy. Can you describe what each of those >>>> requirements is exactly (e.g. what is the difference between "atomic" and >>>> "consistent"?) I can guess at the others but it would be best to lay out >>>> your requirements explicitly. >>>> >>> >>> Here, I am using the term atomicity to mean that the output was true at >>> some point in time and the term consistent to refer to the latest state. >>> >> >> So it would be impossible for the output to be "consistent" (refer to >> latest state) without locking that crosses system calls (see above). >> >> >>> >>> >>>> >>>> My latest thoughts are to implement lzc_list with an "atomic" toggle >>>>>> that does the first way that I implemented things >>>>>> >>>>> >>>> Meaning it grabs locks to prevent create/destroy/rename operations from >>>> taking place while the list is in progress? >>>> >>> >>> Yes. >>> >>> >>>> >>>> as a privileged operation that by default requires root in the global >>>>>> zone on Illumos and pushes the non-atomic code we have now into the >>>>>> kernel >>>>>> when it is not. >>>>>> >>>>>> What do you think? >>>>>> >>>>>> Yours truly, >>>>>> Richard Yao >>>>>> >>>>>> >>>>>> On 27 May 2015 at 17:08, Matthew Ahrens <[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, May 27, 2015 at 7:23 PM, Richard Yao < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Dear Everyone, >>>>>>>> >>>>>>>> As some people know, I have been working on libzfs_core extensions >>>>>>>> and I currently have a prototype lzc_list command that is a large >>>>>>>> subset of >>>>>>>> the functionality of `zfs list`. >>>>>>>> >>>>>>>> After discussing it with others, I suspect that implementing an >>>>>>>> in-core pool metadata snapshot facility for lzc_list would be the most >>>>>>>> natural way of implementing lzc_list. The in-core pool snapshot would >>>>>>>> atomically pin the metadata state of a pool on disk for `zfs list` >>>>>>>> without >>>>>>>> holding locks while it is traversed (my first implementation) or >>>>>>>> requiring >>>>>>>> that we pin memory via holds on dsl_dataset_t objects until the >>>>>>>> operation >>>>>>>> is finished (my second implementation). While the in-core pool metadata >>>>>>>> snapshot is in effect, the blocks containing the pool metadata would >>>>>>>> not be >>>>>>>> marked free in the in-core free space_map, but it would be marked free >>>>>>>> in >>>>>>>> the on-disk space map when it would normally be marked as such. That >>>>>>>> has >>>>>>>> the downside that disk space would not be freed right away, but we >>>>>>>> make no >>>>>>>> guarantees of immediately freeing disk space anyway, so I suspect that >>>>>>>> is >>>>>>>> okay. >>>>>>>> >>>>>>>> Would this be something entirely new or is there already a way to >>>>>>>> snapshot the pool's metadata state in-core either of which I am >>>>>>>> unaware or >>>>>>>> in a branch somewhere? >>>>>>>> >>>>>>>> >>>>>>> Let's say for the sake of argument that we don't overwrite anything >>>>>>> on disk that you care about. What else do you need to do? I'm >>>>>>> imagining >>>>>>> that you will have a separate idea of the metadata state (what datasets >>>>>>> exist, their properties and interrelations, etc), which is out of date >>>>>>> from >>>>>>> what's really on disk. How do you maintain that? It seems nontrivial. >>>>>>> >>>>>>> Maybe you could start by describing the problem that you are >>>>>>> solving? It sounds like you want an atomic view of the pool metadata >>>>>>> (that's used by "zfs list")? >>>>>>> >>>>>>> --matt >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
