Re: [OpenZFS Developer] Temporary in-core DSL Pool metadata snapshots

Richard Yao Fri, 29 May 2015 09:27:12 -0700

It seems that I neglected to include the CC list in my last reply. My
apologies for that.


On 29 May 2015 at 11:42, Matthew Ahrens <[email protected]> wrote:

>
>
> On Fri, May 29, 2015 at 8:34 AM, Richard Yao <[email protected]>
> wrote:
>
>>
>>
>> On 29 May 2015 at 11:25, Matthew Ahrens <[email protected]> wrote:
>>
>>>
>>>
>>> On Fri, May 29, 2015 at 8:09 AM, Richard Yao <[email protected]>
>>> wrote:
>>>
>>>> I should add that the purpose of the pipe is to avoid situations where
>>>> iteration takes arbitrarily long due to needing to allocate enough memory
>>>> in userland and having the kernel/userspace race with increases in memory
>>>> requirements as it iterates (or having things hang from too many/large
>>>> pools to list).
>>>>
>>>
>>> How much memory are we talking about?  (few MB?  few GB?)
>>>
>>> I don't see what would be racy about the situation you described.
>>>
>>
>> Imagine having a trillion datasets on a system with only 1 gigabyte of
>> memory. If we need some constant amount of memory per dataset, that should
>> be required in userland and if userland can avoid it, that is fine.
>>
>
> You have a pool with a trillion datasets?  Impressive!  How long did it
> take to create that?  The most I've seen is on the order of 100,000.
>

I have not yet made one, but one thing that makes sense for a stable API is
to ensure that its ability to be used does not depend on the relative size
of the system memory and pools imported. I had considered relaxing that
idea for the initial implementation to avoid introducing a potential CVE on
systems with user delegation support, but I have come to realize that my
first attempt to avoid introducing a CVE by relaxing that failed to export
a point in time snapshot of the state to userspace.


>
>>
>>>
>>>
>>>>
>>>> On 29 May 2015 at 11:06, Richard Yao <[email protected]> wrote:
>>>>
>>>>> Dear Matt,
>>>>>
>>>>> I am trying to solve is the absence of a sane way to get the state in
>>>>> a manner similar to `zfs list` from libzfs_core.
>>>>>
>>>>
>>> If you're OK with the semantics of "zfs list" (everything it tells you
>>> was true at some point during the list operation), you could do the same
>>> thing with libzfs_core.
>>>
>>
>> This has the same race that rsync has which sending a snapshot fixes,
>> except with `zfs list` rather than filesystem subtrees. It has the
>> potential to cause nasty bugs where userland makes assumptions about output
>> reflecting something that is actually there.
>>
>
> If you need the output to reflect "something that is actually there", then
> you need to prevent changes across not just the "zfs list", but also the
> userland code that examines the output and takes action based on it.
> Otherwise, as soon as we return to userland the output could be wrong
> (because something was create/delete/renamed after the list completed but
> before the user process runs).
>

Having to worry about the output not being consistent with the in-core
state seems harder than having to worry about the output not being a point
in time snapshot. The difference is that you can have things like a dataset
appear twice or not at all due to a rename. There are other possibilities.
One that dawns on me is what happens when userland has dependencies on the
order in which properties are updated. These could be arbitrary user
properties or something already there such as the mountpoint property
and/or those having to do with security.

It is non-obvious that these edge cases are even possible to a potential
userland programmer and it would seem wise to try to avoid them.


>
>>
>>
>>>
>>>> The current API is non-atomic and I was concerned about memory
>>>>> utilization on large/many pools, so I wrote a lzc_list that operated using
>>>>> a pipe while holding locks. When porting that to Illumos, I realized that
>>>>> these locks could be held arbitrarily long by userland, so I came up with 
>>>>> a
>>>>> second approach that did holds. Unfortunately, this lead the code to use
>>>>> linear memory with the number of datasets and hurt atomicity, so I came up
>>>>> with the idea of pinning the the last txg and using that as output.
>>>>>
>>>>
>>> I don't really know what you mean by "pinning the last txg".  (Beyond
>>> not allowing its blocks to be overwritten -- see the questions in my
>>> previous email.)
>>>
>>
>> What I mean is that we would not allow the blocks to be overridden until
>> the list operation finishes.
>>
>
> Can you explain how that helps?  Are you doing the list from an
> independent implementation of ZFS (e.g. like zdb)?  Because the kernel's
> view of the state (e.g. contents of dsl_dataset_phys_t, results of
> zap_lookup) is going to continue changing even if you don't overwrite any
> blocks on disk.
>

I was about doing a hidden read-only second import of the pool at the last
txg (with the current import pinning those blocks), output the data and
undo the operation to ensure that we get a point in time view of the state
without memory kernel requirements increasing with the output.

That said, I have considered 4 main variations of listing output with that
being the fourth. It is similar to the idea of "eventual consistency" used
in distributed systems and could be considered to be analogous to the
snapshots that we do now. The first variation was to do what we do now at
the kernel side of the pipe. The second was locking things to prevent
modifying operations from occurring until output finishes. The third was to
do holds, which unfortunately pins down memory and isn't a proper atomic
view.

My latest idea is to do a mix of #2 and #1, with an "atomic" flag switching
before them and making it into a privileged operation. #4 could be added if
the atomic flag passed as a string saying "lazy", but I suspect that just
#1 and #2 would be sufficient to make consumers happy. The existence of the
atomic flag would alert userland programmers about the races in listing
output and those that need assistance in avoiding them can do so as long as
they do things in a manner that is privileged.

>
>
>>
>>
>>>
>>>
>>>>
>>>>> There seems to be no way to list datasets in a way that is
>>>>> simultaneously atomic, non-blocking, memory efficient and consistent 
>>>>> (where
>>>>> we get the latest state).
>>>>>
>>>>
>>> That is indeed nontrivial.  For example, there's no way to do that for
>>> listing a directory hierarchy.  Can you describe what each of those
>>> requirements is exactly (e.g. what is the difference between "atomic" and
>>> "consistent"?)  I can guess at the others but it would be best to lay out
>>> your requirements explicitly.
>>>
>>
>> Here, I am using the term atomicity to mean that the output was true at
>> some point in time and the term consistent to refer to the latest state.
>>
>
> So it would be impossible for the output to be "consistent" (refer to
> latest state) without locking that crosses system calls (see above).
>
>
>>
>>
>>>
>>> My latest thoughts are to implement lzc_list with an "atomic" toggle
>>>>> that does the first way that I implemented things
>>>>>
>>>>
>>> Meaning it grabs locks to prevent create/destroy/rename operations from
>>> taking place while the list is in progress?
>>>
>>
>> Yes.
>>
>>
>>>
>>> as a privileged operation that by default requires root in the global
>>>>> zone on Illumos and pushes the non-atomic code we have now into the kernel
>>>>> when it is not.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Yours truly,
>>>>> Richard Yao
>>>>>
>>>>>
>>>>> On 27 May 2015 at 17:08, Matthew Ahrens <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 27, 2015 at 7:23 PM, Richard Yao <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Dear Everyone,
>>>>>>>
>>>>>>> As some people know, I have been working on libzfs_core extensions
>>>>>>> and I currently have a prototype lzc_list command that is a large 
>>>>>>> subset of
>>>>>>> the functionality of `zfs list`.
>>>>>>>
>>>>>>> After discussing it with others, I suspect that implementing an
>>>>>>> in-core pool metadata snapshot facility for lzc_list would be the most
>>>>>>> natural way of implementing lzc_list. The in-core pool snapshot would
>>>>>>> atomically pin the metadata state of a pool on disk for `zfs list` 
>>>>>>> without
>>>>>>> holding locks while it is traversed (my first implementation) or 
>>>>>>> requiring
>>>>>>> that we pin memory via holds on dsl_dataset_t objects until the 
>>>>>>> operation
>>>>>>> is finished (my second implementation). While the in-core pool metadata
>>>>>>> snapshot is in effect, the blocks containing the pool metadata would 
>>>>>>> not be
>>>>>>> marked free in the in-core free space_map, but it would be marked free 
>>>>>>> in
>>>>>>> the on-disk space map when it would normally be marked as such. That has
>>>>>>> the downside that disk space would not be freed right away, but we make 
>>>>>>> no
>>>>>>> guarantees of immediately freeing disk space anyway, so I suspect that 
>>>>>>> is
>>>>>>> okay.
>>>>>>>
>>>>>>> Would this be something entirely new or is there already a way to
>>>>>>> snapshot the pool's metadata state in-core either of which I am unaware 
>>>>>>> or
>>>>>>> in a branch somewhere?
>>>>>>>
>>>>>>>
>>>>>> Let's say for the sake of argument that we don't overwrite anything
>>>>>> on disk that you care about.  What else do you need to do?  I'm imagining
>>>>>> that you will have a separate idea of the metadata state (what datasets
>>>>>> exist, their properties and interrelations, etc), which is out of date 
>>>>>> from
>>>>>> what's really on disk.  How do you maintain that?  It seems nontrivial.
>>>>>>
>>>>>> Maybe you could start by describing the problem that you are
>>>>>> solving?  It sounds like you want an atomic view of the pool metadata
>>>>>> (that's used by "zfs list")?
>>>>>>
>>>>>> --matt
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] Temporary in-core DSL Pool metadata snapshots

Reply via email to