On 29 May 2015 at 12:22, Richard Yao <[email protected]> wrote:

> It seems that I neglected to include the CC list in my last reply. My
> apologies for that.
>
> On 29 May 2015 at 11:42, Matthew Ahrens <[email protected]> wrote:
>
>>
>>
>> On Fri, May 29, 2015 at 8:34 AM, Richard Yao <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On 29 May 2015 at 11:25, Matthew Ahrens <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Fri, May 29, 2015 at 8:09 AM, Richard Yao <[email protected]
>>>> > wrote:
>>>>
>>>>> I should add that the purpose of the pipe is to avoid situations where
>>>>> iteration takes arbitrarily long due to needing to allocate enough memory
>>>>> in userland and having the kernel/userspace race with increases in memory
>>>>> requirements as it iterates (or having things hang from too many/large
>>>>> pools to list).
>>>>>
>>>>
>>>> How much memory are we talking about?  (few MB?  few GB?)
>>>>
>>>> I don't see what would be racy about the situation you described.
>>>>
>>>
>>> Imagine having a trillion datasets on a system with only 1 gigabyte of
>>> memory. If we need some constant amount of memory per dataset, that should
>>> be required in userland and if userland can avoid it, that is fine.
>>>
>>
>> You have a pool with a trillion datasets?  Impressive!  How long did it
>> take to create that?  The most I've seen is on the order of 100,000.
>>
>
> I have not yet made one, but one thing that makes sense for a stable API
> is to ensure that its ability to be used does not depend on the relative
> size of the system memory and pools imported. I had considered relaxing
> that idea for the initial implementation to avoid introducing a potential
> CVE on systems with user delegation support, but I have come to realize
> that my first attempt to avoid introducing a CVE by relaxing that failed to
> export a point in time snapshot of the state to userspace.
>
>
>>
>>>
>>>>
>>>>
>>>>>
>>>>> On 29 May 2015 at 11:06, Richard Yao <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Dear Matt,
>>>>>>
>>>>>> I am trying to solve is the absence of a sane way to get the state in
>>>>>> a manner similar to `zfs list` from libzfs_core.
>>>>>>
>>>>>
>>>> If you're OK with the semantics of "zfs list" (everything it tells you
>>>> was true at some point during the list operation), you could do the same
>>>> thing with libzfs_core.
>>>>
>>>
>>> This has the same race that rsync has which sending a snapshot fixes,
>>> except with `zfs list` rather than filesystem subtrees. It has the
>>> potential to cause nasty bugs where userland makes assumptions about output
>>> reflecting something that is actually there.
>>>
>>
>> If you need the output to reflect "something that is actually there",
>> then you need to prevent changes across not just the "zfs list", but also
>> the userland code that examines the output and takes action based on it.
>> Otherwise, as soon as we return to userland the output could be wrong
>> (because something was create/delete/renamed after the list completed but
>> before the user process runs).
>>
>
> Having to worry about the output not being consistent with the in-core
> state seems harder than having to worry about the output not being a point
> in time snapshot. The difference is that you can have things like a dataset
> appear twice or not at all due to a rename. There are other possibilities.
> One that dawns on me is what happens when userland has dependencies on the
> order in which properties are updated. These could be arbitrary user
> properties or something already there such as the mountpoint property
> and/or those having to do with security.
>
> It is non-obvious that these edge cases are even possible to a potential
> userland programmer and it would seem wise to try to avoid them.
>
>
>>
>>>
>>>
>>>>
>>>>> The current API is non-atomic and I was concerned about memory
>>>>>> utilization on large/many pools, so I wrote a lzc_list that operated 
>>>>>> using
>>>>>> a pipe while holding locks. When porting that to Illumos, I realized that
>>>>>> these locks could be held arbitrarily long by userland, so I came up 
>>>>>> with a
>>>>>> second approach that did holds. Unfortunately, this lead the code to use
>>>>>> linear memory with the number of datasets and hurt atomicity, so I came 
>>>>>> up
>>>>>> with the idea of pinning the the last txg and using that as output.
>>>>>>
>>>>>
>>>> I don't really know what you mean by "pinning the last txg".  (Beyond
>>>> not allowing its blocks to be overwritten -- see the questions in my
>>>> previous email.)
>>>>
>>>
>>> What I mean is that we would not allow the blocks to be overridden until
>>> the list operation finishes.
>>>
>>
>> Can you explain how that helps?  Are you doing the list from an
>> independent implementation of ZFS (e.g. like zdb)?  Because the kernel's
>> view of the state (e.g. contents of dsl_dataset_phys_t, results of
>> zap_lookup) is going to continue changing even if you don't overwrite any
>> blocks on disk.
>>
>
> I was about doing a hidden read-only second import of the pool at the last
> txg (with the current import pinning those blocks), output the data and
> undo the operation to ensure that we get a point in time view of the state
> without memory kernel requirements increasing with the output.
>
> That said, I have considered 4 main variations of listing output with that
> being the fourth. It is similar to the idea of "eventual consistency" used
> in distributed systems and could be considered to be analogous to the
> snapshots that we do now. The first variation was to do what we do now at
> the kernel side of the pipe. The second was locking things to prevent
> modifying operations from occurring until output finishes. The third was to
> do holds, which unfortunately pins down memory and isn't a proper atomic
> view.
>
> My latest idea is to do a mix of #2 and #1, with an "atomic" flag
> switching before them and making it into a privileged operation. #4 could
> be added if the atomic flag passed as a string saying "lazy", but I suspect
> that just #1 and #2 would be sufficient to make consumers happy. The
> existence of the atomic flag would alert userland programmers about the
> races in listing output and those that need assistance in avoiding them can
> do so as long as they do things in a manner that is privileged.
>

To be clear, this should be "between", not "before". Also, the atomic flag
selecting #2 would be the privileged version.


>
>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>>> There seems to be no way to list datasets in a way that is
>>>>>> simultaneously atomic, non-blocking, memory efficient and consistent 
>>>>>> (where
>>>>>> we get the latest state).
>>>>>>
>>>>>
>>>> That is indeed nontrivial.  For example, there's no way to do that for
>>>> listing a directory hierarchy.  Can you describe what each of those
>>>> requirements is exactly (e.g. what is the difference between "atomic" and
>>>> "consistent"?)  I can guess at the others but it would be best to lay out
>>>> your requirements explicitly.
>>>>
>>>
>>> Here, I am using the term atomicity to mean that the output was true at
>>> some point in time and the term consistent to refer to the latest state.
>>>
>>
>> So it would be impossible for the output to be "consistent" (refer to
>> latest state) without locking that crosses system calls (see above).
>>
>>
>>>
>>>
>>>>
>>>> My latest thoughts are to implement lzc_list with an "atomic" toggle
>>>>>> that does the first way that I implemented things
>>>>>>
>>>>>
>>>> Meaning it grabs locks to prevent create/destroy/rename operations from
>>>> taking place while the list is in progress?
>>>>
>>>
>>> Yes.
>>>
>>>
>>>>
>>>> as a privileged operation that by default requires root in the global
>>>>>> zone on Illumos and pushes the non-atomic code we have now into the 
>>>>>> kernel
>>>>>> when it is not.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Yours truly,
>>>>>> Richard Yao
>>>>>>
>>>>>>
>>>>>> On 27 May 2015 at 17:08, Matthew Ahrens <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 27, 2015 at 7:23 PM, Richard Yao <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Dear Everyone,
>>>>>>>>
>>>>>>>> As some people know, I have been working on libzfs_core extensions
>>>>>>>> and I currently have a prototype lzc_list command that is a large 
>>>>>>>> subset of
>>>>>>>> the functionality of `zfs list`.
>>>>>>>>
>>>>>>>> After discussing it with others, I suspect that implementing an
>>>>>>>> in-core pool metadata snapshot facility for lzc_list would be the most
>>>>>>>> natural way of implementing lzc_list. The in-core pool snapshot would
>>>>>>>> atomically pin the metadata state of a pool on disk for `zfs list` 
>>>>>>>> without
>>>>>>>> holding locks while it is traversed (my first implementation) or 
>>>>>>>> requiring
>>>>>>>> that we pin memory via holds on dsl_dataset_t objects until the 
>>>>>>>> operation
>>>>>>>> is finished (my second implementation). While the in-core pool metadata
>>>>>>>> snapshot is in effect, the blocks containing the pool metadata would 
>>>>>>>> not be
>>>>>>>> marked free in the in-core free space_map, but it would be marked free 
>>>>>>>> in
>>>>>>>> the on-disk space map when it would normally be marked as such. That 
>>>>>>>> has
>>>>>>>> the downside that disk space would not be freed right away, but we 
>>>>>>>> make no
>>>>>>>> guarantees of immediately freeing disk space anyway, so I suspect that 
>>>>>>>> is
>>>>>>>> okay.
>>>>>>>>
>>>>>>>> Would this be something entirely new or is there already a way to
>>>>>>>> snapshot the pool's metadata state in-core either of which I am 
>>>>>>>> unaware or
>>>>>>>> in a branch somewhere?
>>>>>>>>
>>>>>>>>
>>>>>>> Let's say for the sake of argument that we don't overwrite anything
>>>>>>> on disk that you care about.  What else do you need to do?  I'm 
>>>>>>> imagining
>>>>>>> that you will have a separate idea of the metadata state (what datasets
>>>>>>> exist, their properties and interrelations, etc), which is out of date 
>>>>>>> from
>>>>>>> what's really on disk.  How do you maintain that?  It seems nontrivial.
>>>>>>>
>>>>>>> Maybe you could start by describing the problem that you are
>>>>>>> solving?  It sounds like you want an atomic view of the pool metadata
>>>>>>> (that's used by "zfs list")?
>>>>>>>
>>>>>>> --matt
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to