On Mon, Sep 30, 2013 at 12:49 PM, Anand Avati <av...@gluster.org> wrote:
> > On Mon, Sep 30, 2013 at 9:34 AM, Anand Avati <av...@gluster.org> wrote: > >> >> >> >> On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan < >> srang...@redhat.com> wrote: >> >>> Avati, Amar, >>> >>> Amar, Anand S and myself had a discussion on this comment and here is an >>> answer to your queries the way I see it. Let me know if I am missing >>> something here. >>> >>> (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a >>> single lookup or preserve a single object handle per filesystem object in >>> its cache) >>> >>> Currently a glfs_object is an opaque pointer to an object (it is a >>> _handle_ to the object). The object itself contains a ref'd inode, which is >>> the actual pointer to the object. >>> >>> 1) The similarity and differences of object handles to fds >>> >>> The intention of multiple object handles is in lines with multiple fd's >>> per file, an application using the library is free to lookup (and/or create >>> (and its equivalents)) and acquire as many object handles as it wants for a >>> particular object, and can hence determine the lifetime of each such object >>> in its view. So in essence one thread can have an object handle to perform, >>> say attribute related operations, whereas another thread has the same >>> object looked up to perform IO. >>> >> >> So do you mean a glfs_object is meant to be a *per-operation* handle? If >> one thread wants to perform a chmod() and another thread wants to perform >> chown() and both attempt to resolve the same name and end up getting >> different handles, then both of them unref the glfs_handle right after >> their operation? >> >> >> >>> Where the object handles depart from the notion of fds is when an unlink >>> is performed. As POSIX defines that open fds are still _open_ for >>> activities on the file, the life of an fd and the actual object that it >>> points to is till the fd is closed. In the case of object handles though, >>> the moment any handle is used to unlink the object (which BTW is done using >>> the parent object handle and the name of the child), all handles pointing >>> to the object are still valid pointers, but operations on then will result >>> in ENOENT, as the actual object has since been unlinked and removed by the >>> underlying filesystem. >>> >> >> Not always. If the file had hardlinks the handle should still be valid. >> And if there were no hardlinks and you unlinked the last link, further >> operations must return ESTALE. ENOENT is when a "basename" does not resolve >> to a handle (in entry operations) - for e.g when you try to unlink the same >> entry a second time. Whereas ESTALE is when a presented handle does not >> exist - for e.g when you try to operate (read, chmod) a handle which got >> deleted. >> >> >>> The departure from fds is considered valid in my perspective, as the >>> handle points to an object, which has since been removed, and so there is >>> no semantics here that needs it to be preserved for further operations as >>> there is a reference to it held. >>> >> >> The departure is only in the behavior of unlinked files. That is >> orthogonal to whether you want to return separate handles each time a >> component is looked up. I fail to see how the "departure from fd behavior" >> justifies creating new glfs_object per lookup? >> >> >>> So in essence for each time an object handle is returned by the API, it >>> has to be closed for its life to end. Additionally if the object that it >>> points to is removed from the underlying system, the handle is pointing to >>> an entry that does not exist any longer and returns ENOENT on operations >>> using the same. >>> >>> 2) The issue/benefit of having the same object handle irrespective of >>> looking it up multiple times >>> >>> If we have an 1-1 relationship of object handles (i.e struct >>> glfs_object) to inodes, then the caller gets the same pointer to the >>> handle. Hence having multiple handles as per the caller, boils down to >>> giving out ref counted glfs_object(s) for the same inode. >>> >>> Other than the memory footprint, this will still not make the object >>> live past it's unlink time. The pointer handed out will be still valid till >>> the last ref count is removed (i.e the object handle closed), at which >>> point the object handle can be destroyed. >>> >> >> If I understand what you say above correctly, you intend to solve the >> problem of "unlinked files must return error" at your API layer? That's >> wrong. The right way is to ref-count glfs_object and return them precisely >> because you should NOT make the decision about the end of life of an inode >> at that layer. A hardlink may have been created by another client and the >> glfs_object may therefore be still be valid. >> >> You are also returning separate glfs_object for different hardlinks of a >> file. Does that mean glfs_object is representing a dentry? or a >> per-operation reference to an inode? >> >> So again, as many handles were handed out for the same inode, they have >>> to be closed, etc. >>> >>> 3) Graph switches >>> >>> In the case of graph switches, handles that are used in operations post >>> the switch, get refreshed with an inode from the new graph, if we have an >>> N:1 object to inode relationship. >>> >>> In the case of 1:1 this is done once, but is there some multi thread >>> safety that needs to be in place? I think this is already in place from the >>> glfs_resolve_inode implementation as suggested earlier, but good to check. >>> >>> 4) Renames >>> >>> In the case of renames, the inode remains the same, hence all handed out >>> object handles still are valid and will operate on the right object per se. >>> >>> 5) unlinks and recreation of the same _named_ object in the background >>> >>> Example being, application gets an handle for an object, say named >>> "a.txt", and in the background (or via another application/client) this is >>> deleted and recreated. >>> >>> This will return ENOENT as the GFID would have changed for the >>> previously held object to the new one, even though the names are the same. >>> This seems like the right behaviour, and does not change in the case of a >>> 1:1 of an N:1 object handle to inode mapping. >>> >>> So bottom line, I see the object handles like an fd with the noted >>> difference above. Having them in a 1:1 relationship or as a N:1 >>> relationship does not seem to be an issue from what I understand, what am I >>> missing here? >>> >> >> The issue is this. From what I understand, the usage of glfs_object in >> the FSAL is not like a per-operation handle, but something stored long term >> (many minutes, hours, days) in the per-inode context of the NFS Ganesha >> layer. Now NFS Ganesha may be doing the "right thing" by not re-looking up >> an already looked up name and therefore avoiding a leak (I'm not so sure, >> it still needs to verify every so often if the mapping is still valid). >> From NFS Ganesha's point of view the handle is changing on every lookup. >> >> Now consider what happens in case of READDIRPLUS. A list of names and >> handles are returned to the client. The list of names can possibly include >> names which were previously looked up as well. Both are supposed to >> represent the same "gfid", but here will be returning new glfs_objects. >> When a client performs an operation on a GFID, on which glfs_object will >> the operation be performed at the gfapi layer? This part seems very >> ambiguous and not clear. >> >> What would really help is if you can tell what a glfs_object is supposed >> to represent? - an on disk inode (i.e GFID)? an in memory per-graph inode >> (i.e inode_t)? A dentry? A per-operation handle to an on disk inode? A >> per-operation handle to an in memory per-graph inode? A per operation >> handle to a dentry? In the current form, it does not seem to fit any of the >> these categories. >> >> >> Avati >> > > > After giving this some more thought, I feel the cleanest way is to make > inode_t and inode table graph aware. This way for a given GFID there will > be one and only one inode_t at a given time no matter how many graphs are > switched. It is also worth noting that relationship between two GFIDs does > not change with a graph switch, so having a separate inode table with > duplicate inodes and dentries has always been redundant in a way. The > initial decision to have separate inode table per graph was done because > inode table was bound to an xlator_t (which in turn was bound to a graph). > > If we make inode_t and inode table multi-graph aware, the same inode_t > would be valid on a new graph. We would need new code to keep track of the > latest graph on which a given inode has been "initialized / discovered" in > order to force a discover() on new graph if necessary (dentry relations > would just continue to be valid), and after a new graph switch, to force > cleanup of xlators from old graph. > More specifically, we need a per-graph inode->ctx. On a graph switch, we allocate (and initialize through ->discover()) the new ctx and destroy the old ctx with a graph-specific inode_ctx_destroy(). API calls like inode_ctx_{get,set} must be formalized to have xlator_t as the key (today it is just a convention), and internally the APIs should pick the right inode->ctx based on xlator_t->graph of the key. Avati > This way, from your layer, glfs_object would be an alias to inode_t and > internally you can just typedef it. This would also require some changes in > the resolver code (in both fuse and glfs-resolve) to handle graph switches > and fd migration in the new way, but I imagine the new way will be much > simpler/cleaner than current approach anyways. It would also solve the > "handles returned in readdirplus" issue too. > > Another reason why I prefer this new approach is, making inode_t graph > independent makes old graph destruction completely "in our control", > without having to depend on /force fuse to issue FORGET on inode_ts from > the old graph. That entire problem gets eliminated as inode_ts would now be > graph independent. > > (copying Raghavendra Bhat who is performing graph destruction work and > Amar) > > Thoughts? > Avati > > > > >> >> >> Shyam >>> >>> >>> ------------------------------ >>> >>> *From: *"Anand Avati" <av...@gluster.org> >>> *To: *"Shyamsundar Ranganathan" <srang...@redhat.com> >>> *Cc: *"Gluster Devel" <gluster-devel@nongnu.org> >>> *Sent: *Monday, September 30, 2013 10:35:05 AM >>> >>> *Subject: *Re: RFC/Review: libgfapi object handle based extensions >>> >>> I see a pretty core issue - lifecycle management of 'struct >>> glfs_object'. What is the structure representing? When is it created? When >>> is it destroyed? How does it relate to inode_t? >>> >>> Looks like for every lookup() we are creating a new glfs_object, even if >>> the looked up inode was already looked up before (in the cache) and had a >>> glfs_object created for it in the recent past. >>> >>> We need a stronger relationship between the two with a clearer >>> relationship. It is probably necessary for a glfs_object to represent >>> mulitple inode_t's at different points in time depending on graph switches, >>> but for a given inode_t we need only one glfs_object. We definitely must >>> NOT have a new glfs_object per lookup call. >>> >>> Avati >>> >>> On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan < >>> srang...@redhat.com> wrote: >>> >>>> Avati, >>>> >>>> Please find the updated patch set for review at gerrit. >>>> http://review.gluster.org/#/c/5936/ >>>> >>>> Changes made to address the points (1) (2) and (3) below. By the usage >>>> of the suggested glfs_resolve_inode approach. >>>> >>>> I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more >>>> on this a little later). >>>> >>>> So currently, the review request is for all APIs other than, >>>> glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid >>>> >>>> glfs_resolve_at: Using this function the terminal name will be a force >>>> look up anyway (as force_lookup will be passed as 1 based on >>>> !next_component). We need to avoid this _extra_ lookup in the unlink case, >>>> which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in >>>> the first place. >>>> >>>> Having said the above, we should still leverage glfs_resolve_at anyway, >>>> as there seem to be other corner cases where the resolved inode and subvol >>>> maybe from different graphs. So I think I want to modify glfs_resolve_at to >>>> make a conditional force_lookup, based on iatt being NULL or not. IOW, >>>> change the call to glfs_resolve_component with the conditional as, (reval >>>> || (!next_component && iatt)). So that callers that do not want the iatt >>>> filled, can skip the syncop_lookup. >>>> >>>> Request comments on the glfs_resolve_at proposal. >>>> >>>> Shyam. >>>> >>>> ----- Original Message ----- >>>> >>>> > From: "Anand Avati" <av...@gluster.org> >>>> > To: "Shyamsundar Ranganathan" <srang...@redhat.com> >>>> > Cc: "Gluster Devel" <gluster-devel@nongnu.org> >>>> > Sent: Wednesday, September 18, 2013 11:39:27 AM >>>> > Subject: Re: RFC/Review: libgfapi object handle based extensions >>>> >>>> > Minor comments are made in gerrit. Here is a larger (more important) >>>> comment >>>> > for which email is probably more convenient. >>>> >>>> > There is a problem in the general pattern of the fops, for example >>>> > glfs_h_setattrs() (and others too) >>>> >>>> > 1. glfs_validate_inode() has the assumption that object->inode deref >>>> is a >>>> > guarded operation, but here we are doing an unguarded deref in the >>>> paramter >>>> > glfs_resolve_base(). >>>> >>>> > 2. A more important issue, glfs_active_subvol() and >>>> glfs_validate_inode() are >>>> > not atomic. glfs_active_subvol() can return an xlator from one graph, >>>> but by >>>> > the time glfs_validate_inode() is called, a graph switch could have >>>> happened >>>> > and inode can get resolved to a different graph. And in >>>> syncop_XXXXXX() we >>>> > end up calling on graph1 with inode belonging to graph2. >>>> >>>> > 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based >>>> > operations. The ESTALE_RETRY macro exists for path based FOPs where >>>> the >>>> > resolved handle could have turned stale by the time we perform the FOP >>>> > (where resolution and FOP are non-atomic). Over here, the handle is >>>> > predetermined, and it does not make sense to retry on ESTALE (notice >>>> that FD >>>> > based fops in glfs-fops.c also do not have ESTALE_RETRY for this same >>>> > reason) >>>> >>>> > I think the pattern should be similar to FD based fops which >>>> specifically >>>> > address both the above problems. Here's an outline: >>>> >>>> > glfs_h_XXXX(struct glfs *fs, glfs_object *object, ...) >>>> > { >>>> > xlator_t *subvol = NULL; >>>> > inode_t *inode = NULL; >>>> >>>> > __glfs_entry_fs (fs); >>>> >>>> > subvol = glfs_active_subvol (fs); >>>> > if (!subvol) { errno = EIO; ... goto out; } >>>> >>>> > inode = glfs_resolve_inode (fs, object, subvol); >>>> > if (!inode) { errno = ESTALE; ... goto out; } >>>> >>>> > loc.inode = inode; >>>> > ret = syncop_XXXX(subvol, &loc, ...); >>>> >>>> > } >>>> >>>> > Notice the signature of glfs_resolve_inode(). What it does: given a >>>> > glfs_object, and a subvol, it returns an inode_t which is resolved on >>>> that >>>> > subvol. This way the syncop_XXX() is performed with matching subvol >>>> and >>>> > inode. Also it returns the inode pointer so that no unsafe >>>> object->inode >>>> > deref is done by the caller. Again, this is the same pattern followed >>>> by the >>>> > fd based fops already. >>>> >>>> > Also, as mentioned in one of the comments, please consider using >>>> > glfs_resolve_at() and avoiding manual construction of loc_t. >>>> >>>> > Thanks, >>>> > Avati >>>> >>> >>> >>> >> >
_______________________________________________ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel