On Fri, May 29, 2015 at 10:00 AM, Richard Yao <[email protected]>
wrote:

> It seems that no one is particularly interested in a strict point in time
> version,
>

To be clear, I'm not saying that I'm uninterested in that, just that it
seems hard/complicated to achieve in combination with other requirements
(specifically, not needing to have the entire state in memory at once).
 (And I don't know that anyone else has weighed in.)


> so I'll drop that design constraint, document the issue and hope userland
> programmers relying on the output will actually read the documentation
> carefully so that they do not make the naive assumption that I imagine that
> they will make.
>

FWIW, userland programmers already need to be aware of these issues when
iterating over directory entries (e.g. if a file/directory is concurrently
renamed, "find" can list it 0, 1, or 2 times).  Concurrent renames are
pretty difficult to deal with.

--matt


> On 29 May 2015 at 12:24, Richard Yao <[email protected]> wrote:
>
>>
>>
>> On 29 May 2015 at 12:22, Richard Yao <[email protected]> wrote:
>>
>>> It seems that I neglected to include the CC list in my last reply. My
>>> apologies for that.
>>>
>>> On 29 May 2015 at 11:42, Matthew Ahrens <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Fri, May 29, 2015 at 8:34 AM, Richard Yao <[email protected]
>>>> > wrote:
>>>>
>>>>>
>>>>>
>>>>> On 29 May 2015 at 11:25, Matthew Ahrens <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 29, 2015 at 8:09 AM, Richard Yao <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I should add that the purpose of the pipe is to avoid situations
>>>>>>> where iteration takes arbitrarily long due to needing to allocate enough
>>>>>>> memory in userland and having the kernel/userspace race with increases 
>>>>>>> in
>>>>>>> memory requirements as it iterates (or having things hang from too
>>>>>>> many/large pools to list).
>>>>>>>
>>>>>>
>>>>>> How much memory are we talking about?  (few MB?  few GB?)
>>>>>>
>>>>>> I don't see what would be racy about the situation you described.
>>>>>>
>>>>>
>>>>> Imagine having a trillion datasets on a system with only 1 gigabyte of
>>>>> memory. If we need some constant amount of memory per dataset, that should
>>>>> be required in userland and if userland can avoid it, that is fine.
>>>>>
>>>>
>>>> You have a pool with a trillion datasets?  Impressive!  How long did it
>>>> take to create that?  The most I've seen is on the order of 100,000.
>>>>
>>>
>>> I have not yet made one, but one thing that makes sense for a stable API
>>> is to ensure that its ability to be used does not depend on the relative
>>> size of the system memory and pools imported. I had considered relaxing
>>> that idea for the initial implementation to avoid introducing a potential
>>> CVE on systems with user delegation support, but I have come to realize
>>> that my first attempt to avoid introducing a CVE by relaxing that failed to
>>> export a point in time snapshot of the state to userspace.
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On 29 May 2015 at 11:06, Richard Yao <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dear Matt,
>>>>>>>>
>>>>>>>> I am trying to solve is the absence of a sane way to get the state
>>>>>>>> in a manner similar to `zfs list` from libzfs_core.
>>>>>>>>
>>>>>>>
>>>>>> If you're OK with the semantics of "zfs list" (everything it tells
>>>>>> you was true at some point during the list operation), you could do the
>>>>>> same thing with libzfs_core.
>>>>>>
>>>>>
>>>>> This has the same race that rsync has which sending a snapshot fixes,
>>>>> except with `zfs list` rather than filesystem subtrees. It has the
>>>>> potential to cause nasty bugs where userland makes assumptions about 
>>>>> output
>>>>> reflecting something that is actually there.
>>>>>
>>>>
>>>> If you need the output to reflect "something that is actually there",
>>>> then you need to prevent changes across not just the "zfs list", but also
>>>> the userland code that examines the output and takes action based on it.
>>>> Otherwise, as soon as we return to userland the output could be wrong
>>>> (because something was create/delete/renamed after the list completed but
>>>> before the user process runs).
>>>>
>>>
>>> Having to worry about the output not being consistent with the in-core
>>> state seems harder than having to worry about the output not being a point
>>> in time snapshot. The difference is that you can have things like a dataset
>>> appear twice or not at all due to a rename. There are other possibilities.
>>> One that dawns on me is what happens when userland has dependencies on the
>>> order in which properties are updated. These could be arbitrary user
>>> properties or something already there such as the mountpoint property
>>> and/or those having to do with security.
>>>
>>> It is non-obvious that these edge cases are even possible to a potential
>>> userland programmer and it would seem wise to try to avoid them.
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> The current API is non-atomic and I was concerned about memory
>>>>>>>> utilization on large/many pools, so I wrote a lzc_list that operated 
>>>>>>>> using
>>>>>>>> a pipe while holding locks. When porting that to Illumos, I realized 
>>>>>>>> that
>>>>>>>> these locks could be held arbitrarily long by userland, so I came up 
>>>>>>>> with a
>>>>>>>> second approach that did holds. Unfortunately, this lead the code to 
>>>>>>>> use
>>>>>>>> linear memory with the number of datasets and hurt atomicity, so I 
>>>>>>>> came up
>>>>>>>> with the idea of pinning the the last txg and using that as output.
>>>>>>>>
>>>>>>>
>>>>>> I don't really know what you mean by "pinning the last txg".  (Beyond
>>>>>> not allowing its blocks to be overwritten -- see the questions in my
>>>>>> previous email.)
>>>>>>
>>>>>
>>>>> What I mean is that we would not allow the blocks to be overridden
>>>>> until the list operation finishes.
>>>>>
>>>>
>>>> Can you explain how that helps?  Are you doing the list from an
>>>> independent implementation of ZFS (e.g. like zdb)?  Because the kernel's
>>>> view of the state (e.g. contents of dsl_dataset_phys_t, results of
>>>> zap_lookup) is going to continue changing even if you don't overwrite any
>>>> blocks on disk.
>>>>
>>>
>>> I was about doing a hidden read-only second import of the pool at the
>>> last txg (with the current import pinning those blocks), output the data
>>> and undo the operation to ensure that we get a point in time view of the
>>> state without memory kernel requirements increasing with the output.
>>>
>>> That said, I have considered 4 main variations of listing output with
>>> that being the fourth. It is similar to the idea of "eventual consistency"
>>> used in distributed systems and could be considered to be analogous to the
>>> snapshots that we do now. The first variation was to do what we do now at
>>> the kernel side of the pipe. The second was locking things to prevent
>>> modifying operations from occurring until output finishes. The third was to
>>> do holds, which unfortunately pins down memory and isn't a proper atomic
>>> view.
>>>
>>> My latest idea is to do a mix of #2 and #1, with an "atomic" flag
>>> switching before them and making it into a privileged operation. #4 could
>>> be added if the atomic flag passed as a string saying "lazy", but I suspect
>>> that just #1 and #2 would be sufficient to make consumers happy. The
>>> existence of the atomic flag would alert userland programmers about the
>>> races in listing output and those that need assistance in avoiding them can
>>> do so as long as they do things in a manner that is privileged.
>>>
>>
>> To be clear, this should be "between", not "before". Also, the atomic
>> flag selecting #2 would be the privileged version.
>>
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> There seems to be no way to list datasets in a way that is
>>>>>>>> simultaneously atomic, non-blocking, memory efficient and consistent 
>>>>>>>> (where
>>>>>>>> we get the latest state).
>>>>>>>>
>>>>>>>
>>>>>> That is indeed nontrivial.  For example, there's no way to do that
>>>>>> for listing a directory hierarchy.  Can you describe what each of those
>>>>>> requirements is exactly (e.g. what is the difference between "atomic" and
>>>>>> "consistent"?)  I can guess at the others but it would be best to lay out
>>>>>> your requirements explicitly.
>>>>>>
>>>>>
>>>>> Here, I am using the term atomicity to mean that the output was true
>>>>> at some point in time and the term consistent to refer to the latest 
>>>>> state.
>>>>>
>>>>
>>>> So it would be impossible for the output to be "consistent" (refer to
>>>> latest state) without locking that crosses system calls (see above).
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> My latest thoughts are to implement lzc_list with an "atomic" toggle
>>>>>>>> that does the first way that I implemented things
>>>>>>>>
>>>>>>>
>>>>>> Meaning it grabs locks to prevent create/destroy/rename operations
>>>>>> from taking place while the list is in progress?
>>>>>>
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>>
>>>>>> as a privileged operation that by default requires root in the global
>>>>>>>> zone on Illumos and pushes the non-atomic code we have now into the 
>>>>>>>> kernel
>>>>>>>> when it is not.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Yours truly,
>>>>>>>> Richard Yao
>>>>>>>>
>>>>>>>>
>>>>>>>> On 27 May 2015 at 17:08, Matthew Ahrens <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 27, 2015 at 7:23 PM, Richard Yao <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Dear Everyone,
>>>>>>>>>>
>>>>>>>>>> As some people know, I have been working on libzfs_core
>>>>>>>>>> extensions and I currently have a prototype lzc_list command that is 
>>>>>>>>>> a
>>>>>>>>>> large subset of the functionality of `zfs list`.
>>>>>>>>>>
>>>>>>>>>> After discussing it with others, I suspect that implementing an
>>>>>>>>>> in-core pool metadata snapshot facility for lzc_list would be the 
>>>>>>>>>> most
>>>>>>>>>> natural way of implementing lzc_list. The in-core pool snapshot would
>>>>>>>>>> atomically pin the metadata state of a pool on disk for `zfs list` 
>>>>>>>>>> without
>>>>>>>>>> holding locks while it is traversed (my first implementation) or 
>>>>>>>>>> requiring
>>>>>>>>>> that we pin memory via holds on dsl_dataset_t objects until the 
>>>>>>>>>> operation
>>>>>>>>>> is finished (my second implementation). While the in-core pool 
>>>>>>>>>> metadata
>>>>>>>>>> snapshot is in effect, the blocks containing the pool metadata would 
>>>>>>>>>> not be
>>>>>>>>>> marked free in the in-core free space_map, but it would be marked 
>>>>>>>>>> free in
>>>>>>>>>> the on-disk space map when it would normally be marked as such. That 
>>>>>>>>>> has
>>>>>>>>>> the downside that disk space would not be freed right away, but we 
>>>>>>>>>> make no
>>>>>>>>>> guarantees of immediately freeing disk space anyway, so I suspect 
>>>>>>>>>> that is
>>>>>>>>>> okay.
>>>>>>>>>>
>>>>>>>>>> Would this be something entirely new or is there already a way to
>>>>>>>>>> snapshot the pool's metadata state in-core either of which I am 
>>>>>>>>>> unaware or
>>>>>>>>>> in a branch somewhere?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Let's say for the sake of argument that we don't overwrite
>>>>>>>>> anything on disk that you care about.  What else do you need to do?  
>>>>>>>>> I'm
>>>>>>>>> imagining that you will have a separate idea of the metadata state 
>>>>>>>>> (what
>>>>>>>>> datasets exist, their properties and interrelations, etc), which is 
>>>>>>>>> out of
>>>>>>>>> date from what's really on disk.  How do you maintain that?  It seems
>>>>>>>>> nontrivial.
>>>>>>>>>
>>>>>>>>> Maybe you could start by describing the problem that you are
>>>>>>>>> solving?  It sounds like you want an atomic view of the pool metadata
>>>>>>>>> (that's used by "zfs list")?
>>>>>>>>>
>>>>>>>>> --matt
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to