Re: [Gluster-devel] compound fop design first cut
On 01/06/2016 07:50 PM, Jeff Darcy wrote: 1) fops will be compounded per inode, meaning 2 fops on different inodes can't be compounded (Not because of the design, Just reducing scope of the problem). 2) Each xlator that wants a compound fop packs the arguments by itself. Packed how? Are we talking about XDR here, or something else? How is dict_t handled? Will there be generic packing/unpacking code somewhere, or is each translator expected to do this manually? Packed as mentioned in step-4 below. There will be common functions provided which will fill an array cell with the given information to the function for that fop. In conjunction to that there will be filling functions for each of the compound fops listed at: https://public.pad.fsfe.org/p/glusterfs-compound-fops. XDR should be similar to what Soumya suggested in earlier mails just like in NFS. 3) On the server side a de-compounder placed below server xlator unpacks the arguments and does the necessary operations. 4) Arguments for compound fops will be passed as array of union of structures where each structure is associated with a fop. 5) Each xlator will have _compound_fop () which receives the fop and does additional processing that is required for itself. What happens when (not if) some translator fails to provide this? Is there a default function? Is there something at the end of the chain that will log an error if the fop gets that far without being handled (as with GF_FOP_IPC)? Yes there will be default_fop provided just like other fops which is just a pass through. Posix will log unwind with -1, ENOTSUPP. 6) Response will also be an array of union of response structures where each structure is associated with a fop's response. What are the error semantics? Does processing of a series always stop at the first error, or are there some errors that allow retry/continue? If/when processing stops, who's responsible for cleaning up state changed by those parts that succeeded? What happens if the connection dies in the middle? Yes, at the moment we are implementing stop at first error semantics as it seems to satisfy all the compound fops we listed @ https://public.pad.fsfe.org/p/glusterfs-compound-fops. Each translator which looks to handle the compound fop should handle failures just like they do for normal fop at the moment. How are values returned from one operation in a series propagated as arguments for the next? They are not. In the first cut the only dependency between two fops now is whether the previous one succeeded or not. Just this much seems to work fine for the fops we are targeting for now: https://public.pad.fsfe.org/p/glusterfs-compound-fops, We may have to enhance it in future based on what will come up in the future. What are the implications for buffer and message sizes? What are the limits on how large these can get, and/or how many operations can be compounded? It depends on the limits imposed by rpc layer. If it can't send the request, the fop will fail. If it can send the request but the response is too big to send back, I think the fop will lead to error by frame timeout for the response. Either way it will be a failure. At the moment for the fops listed at: https://public.pad.fsfe.org/p/glusterfs-compound-fops this doesn't seem to be a problem. How is synchronization handled? Is the inode locked for the duration of the compound operation, to prevent other operations from changing the context in which later parts of the compound operation execute? Are there possibilities for deadlock here? Alternatively, if no locking is done, are we going to document the fact that compound operations are not atomic/linearizable? Since we are limiting the scope to single inode fops, locking should suffice. EC doesn't have any problem as it just has one lock for both data/entry, metadata locks. In afr we need to come up with locking order for metadata, data domains. Something similar to what we do in rename where we need to take multiple locks. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
> 1) fops will be compounded per inode, meaning 2 fops on different > inodes can't be compounded (Not because of the design, Just reducing > scope of the problem). > > 2) Each xlator that wants a compound fop packs the arguments by > itself. Packed how? Are we talking about XDR here, or something else? How is dict_t handled? Will there be generic packing/unpacking code somewhere, or is each translator expected to do this manually? > 3) On the server side a de-compounder placed below server xlator > unpacks the arguments and does the necessary operations. > > 4) Arguments for compound fops will be passed as array of union of > structures where each structure is associated with a fop. > > 5) Each xlator will have _compound_fop () which receives the > fop and does additional processing that is required for itself. What happens when (not if) some translator fails to provide this? Is there a default function? Is there something at the end of the chain that will log an error if the fop gets that far without being handled (as with GF_FOP_IPC)? > 6) Response will also be an array of union of response structures > where each structure is associated with a fop's response. What are the error semantics? Does processing of a series always stop at the first error, or are there some errors that allow retry/continue? If/when processing stops, who's responsible for cleaning up state changed by those parts that succeeded? What happens if the connection dies in the middle? How are values returned from one operation in a series propagated as arguments for the next? What are the implications for buffer and message sizes? What are the limits on how large these can get, and/or how many operations can be compounded? How is synchronization handled? Is the inode locked for the duration of the compound operation, to prevent other operations from changing the context in which later parts of the compound operation execute? Are there possibilities for deadlock here? Alternatively, if no locking is done, are we going to document the fact that compound operations are not atomic/linearizable? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
Hi, After discussions with Pranith and Soumya, here is the design for compound fops: 1) fops will be compounded per inode, meaning 2 fops on different inodes can't be compounded (Not because of the design, Just reducing scope of the problem). 2) Each xlator that wants a compound fop packs the arguments by itself. 3) On the server side a de-compounder placed below server xlator unpacks the arguments and does the necessary operations. 4) Arguments for compound fops will be passed as array of union of structures where each structure is associated with a fop. 5) Each xlator will have _compound_fop () which receives the fop and does additional processing that is required for itself. 6) Response will also be an array of union of response structures where each structure is associated with a fop's response. Comments welcome! - Original Message - > From: "Milind Changire" > To: "Jeff Darcy" > Cc: "Gluster Devel" > Sent: Friday, December 11, 2015 9:25:38 PM > Subject: Re: [Gluster-devel] compound fop design first cut > > > > On Wed, Dec 9, 2015 at 8:02 PM, Jeff Darcy < jda...@redhat.com > wrote: > > > > > > On December 9, 2015 at 7:07:06 AM, Ira Cooper ( i...@redhat.com ) wrote: > > A simple "abort on failure" and let the higher levels clean it up is > > probably right for the type of compounding I propose. It is what SMB2 > > does. So, if you get an error return value, cancel the rest of the > > request, and have it return ECOMPOUND as the errno. > > This is exactly the part that worries me. If a compound operation > fails, some parts of it will often need to be undone. “Let the higher > levels clean it up” means that rollback code will be scattered among all > of the translators that use compound operations. Some of them will do > it right. Others . . . less so. ;) All willl have to be tested > separately. If we centralize dispatch of compound operations into one > piece of code, we can centralize error detection and recovery likewise. > That ensures uniformity of implementation, and facilitates focused > testing (or even formal proof) of that implementation. > > Can we gain the same benefits with a more generic design? Perhaps. It > would require that the compounding translator know how to reverse each > type of operation, so that it can do so after an error. That’s > feasible, though it does mean maintaining a stack of undo actions > instead of a simple state. It might also mean testing combinations and > scenarios that will actually never occur in other components’ usage of > the compounding feature. More likely it means that people will *think* > they can use the facility in unanticipated ways, until their > unanticipated usage creates a combination or scenario that was never > tested and doesn’t work. Those are going to be hard problems to debug. > I think it’s better to be explicit about which permutations we actually > expect to work, and have those working earlier. > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > Could we have a dry-run phase and a commit phase for the compound operation. > The dry-run phase phase could test the validity of the transaction and the > commit phase can actually perform the operation. > > If any of the operation in the dry-run operation sequence returns error, the > compound operation can be aborted immediately without the complexity of an > undo ... scattered or centralized. > > But if the subsequent operations depend on the changed state of the system > from earlier operations, then we'll have to introduce a system state object > for such transactions ... and maybe serialize such operations. The system > state object can be passed through the operation sequence. How well this > idea would work in a multi-threaded world is not clear to me too. > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel -- Thanks, Anuradha. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
On Wed, Dec 9, 2015 at 8:02 PM, Jeff Darcy wrote: > > > > On December 9, 2015 at 7:07:06 AM, Ira Cooper (i...@redhat.com) wrote: > > A simple "abort on failure" and let the higher levels clean it up is > > probably right for the type of compounding I propose. It is what SMB2 > > does. So, if you get an error return value, cancel the rest of the > > request, and have it return ECOMPOUND as the errno. > > This is exactly the part that worries me. If a compound operation > fails, some parts of it will often need to be undone. “Let the higher > levels clean it up” means that rollback code will be scattered among all > of the translators that use compound operations. Some of them will do > it right. Others . . . less so. ;) All willl have to be tested > separately. If we centralize dispatch of compound operations into one > piece of code, we can centralize error detection and recovery likewise. > That ensures uniformity of implementation, and facilitates focused > testing (or even formal proof) of that implementation. > > Can we gain the same benefits with a more generic design? Perhaps. It > would require that the compounding translator know how to reverse each > type of operation, so that it can do so after an error. That’s > feasible, though it does mean maintaining a stack of undo actions > instead of a simple state. It might also mean testing combinations and > scenarios that will actually never occur in other components’ usage of > the compounding feature. More likely it means that people will *think* > they can use the facility in unanticipated ways, until their > unanticipated usage creates a combination or scenario that was never > tested and doesn’t work. Those are going to be hard problems to debug. > I think it’s better to be explicit about which permutations we actually > expect to work, and have those working earlier. > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > Could we have a dry-run phase and a commit phase for the compound operation. The dry-run phase phase could test the validity of the transaction and the commit phase can actually perform the operation. If any of the operation in the dry-run operation sequence returns error, the compound operation can be aborted immediately without the complexity of an undo ... scattered or centralized. But if the subsequent operations depend on the changed state of the system from earlier operations, then we'll have to introduce a system state object for such transactions ... and maybe serialize such operations. The system state object can be passed through the operation sequence. How well this idea would work in a multi-threaded world is not clear to me too. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 11:48 PM, Pranith Kumar Karampuri wrote: On 12/09/2015 08:11 PM, Shyam wrote: On 12/09/2015 02:37 AM, Soumya Koduri wrote: On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote: On 12/09/2015 06:37 AM, Vijay Bellur wrote: On 12/08/2015 03:45 PM, Jeff Darcy wrote: On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: Raghavendra Gowdappa writes: I propose that we define a "compound op" that contains ops. Within each op, there are fields that can be "inherited" from the previous op, via use of a sentinel value. Sentinel is -1, for all of these examples. So: LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 is the root directory, as a gfid, by convention.) OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) WRITE(-1, "foo", 3) (Uses the glfd compound value.) CLOSE(-1) (Uses the glfd compound value) So, basically, what the programming-language types would call futures and promises. It’s a good and well studied concept, which is necessary to solve the second-order problem of how to specify an argument in sub-operation N+1 that’s not known until sub-operation N completes. To be honest, some of the highly general approaches suggested here scare me too. Wrapping up the arguments for one sub-operation in xdata for another would get pretty hairy if we ever try to go beyond two sub-operations and have to nest sub-operation #3’s args within sub-operation #2’s xdata which is itself encoded within sub-operation #1’s xdata. There’s also not much clarity about how to handle errors in that model. Encoding N sub-operations’ arguments in a linear structure as Shyam proposes seems a bit cleaner that way. If I were to continue down that route I’d suggest just having start_compound and end-compound fops, plus an extra field (or by-convention xdata key) that either the client-side or server-side translator could use to build whatever structure it wants and schedule sub-operations however it wants. However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. For each of those, we can easily define a structure that contains the necessary fields, we don’t need a client-side translator, and the server-side translator can take care of “forwarding” results from one sub-operation to the next. We could even use GF_FOP_IPC to prototype this. If we later find that the number of “one-off” compound requests is growing too large, then at least we’ll have some experience to guide our design of a more general alternative. Right now, I think we’re trying to look further ahead than we can see clearly. Yes Agree. This makes implementation on the client side simpler as well. So it is welcome. Just updating the solution. 1) New RPCs are going to be implemented. 2) client stack will use these new fops. 3) On the server side we have server xlator implementing these new fops to decode the RPC request then resolve_resume and compound-op-receiver(Better name for this is welcome) which sends one op after other and send compound fop response. @Pranith, I assume you would expand on this at a later date (something along the lines of what Soumya has done below, right? I will talk to her tomorrow to know more about this. Not saying this is what I will be implementing (There doesn't seem to be any consensus yet). But I would love to know how it is implemented. Soumya and I had a discussion about this and it seems like the NFS way of stuffing the args seems to workout at a high level. Even the sentinel value based work may also be possible. What I will do now is to take a look at the structure deeply and work out how all the fops mentioned in this thread can be implemented. I will update you guys about my findings in a couple of days. Pranith Pranith List of compound fops identified so far: Swift/S3: PUT: creat(), write()s, setxattr(), fsync(), close(), rename() Dht: mkdir + inodelk Afr: xattrop+writev, xattrop+unlock to begin with. Could everyone who needs compound fops add to this list? I see that Niels is back on 14th. Does anyone else know the list of compound fops he has in mind? From the discussions we had with Niels regarding the kerberos support on GlusterFS, I think below are the set of compound fops which are required. set_uid + set_gid + set_lkowner (or kerberos principal name) + actual_fop Also gfapi does lookup (first time/to refresh inode) before performing actual fops most of the times. It may really help if we can club such fops - @Soumya +5 (just a random number :) ) This came to my mind as wel
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 08:11 PM, Shyam wrote: On 12/09/2015 02:37 AM, Soumya Koduri wrote: On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote: On 12/09/2015 06:37 AM, Vijay Bellur wrote: On 12/08/2015 03:45 PM, Jeff Darcy wrote: On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: Raghavendra Gowdappa writes: I propose that we define a "compound op" that contains ops. Within each op, there are fields that can be "inherited" from the previous op, via use of a sentinel value. Sentinel is -1, for all of these examples. So: LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 is the root directory, as a gfid, by convention.) OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) WRITE(-1, "foo", 3) (Uses the glfd compound value.) CLOSE(-1) (Uses the glfd compound value) So, basically, what the programming-language types would call futures and promises. It’s a good and well studied concept, which is necessary to solve the second-order problem of how to specify an argument in sub-operation N+1 that’s not known until sub-operation N completes. To be honest, some of the highly general approaches suggested here scare me too. Wrapping up the arguments for one sub-operation in xdata for another would get pretty hairy if we ever try to go beyond two sub-operations and have to nest sub-operation #3’s args within sub-operation #2’s xdata which is itself encoded within sub-operation #1’s xdata. There’s also not much clarity about how to handle errors in that model. Encoding N sub-operations’ arguments in a linear structure as Shyam proposes seems a bit cleaner that way. If I were to continue down that route I’d suggest just having start_compound and end-compound fops, plus an extra field (or by-convention xdata key) that either the client-side or server-side translator could use to build whatever structure it wants and schedule sub-operations however it wants. However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. For each of those, we can easily define a structure that contains the necessary fields, we don’t need a client-side translator, and the server-side translator can take care of “forwarding” results from one sub-operation to the next. We could even use GF_FOP_IPC to prototype this. If we later find that the number of “one-off” compound requests is growing too large, then at least we’ll have some experience to guide our design of a more general alternative. Right now, I think we’re trying to look further ahead than we can see clearly. Yes Agree. This makes implementation on the client side simpler as well. So it is welcome. Just updating the solution. 1) New RPCs are going to be implemented. 2) client stack will use these new fops. 3) On the server side we have server xlator implementing these new fops to decode the RPC request then resolve_resume and compound-op-receiver(Better name for this is welcome) which sends one op after other and send compound fop response. @Pranith, I assume you would expand on this at a later date (something along the lines of what Soumya has done below, right? I will talk to her tomorrow to know more about this. Not saying this is what I will be implementing (There doesn't seem to be any consensus yet). But I would love to know how it is implemented. Pranith List of compound fops identified so far: Swift/S3: PUT: creat(), write()s, setxattr(), fsync(), close(), rename() Dht: mkdir + inodelk Afr: xattrop+writev, xattrop+unlock to begin with. Could everyone who needs compound fops add to this list? I see that Niels is back on 14th. Does anyone else know the list of compound fops he has in mind? From the discussions we had with Niels regarding the kerberos support on GlusterFS, I think below are the set of compound fops which are required. set_uid + set_gid + set_lkowner (or kerberos principal name) + actual_fop Also gfapi does lookup (first time/to refresh inode) before performing actual fops most of the times. It may really help if we can club such fops - @Soumya +5 (just a random number :) ) This came to my mind as well, and is a good candidate for compounding. LOOKUP + FOP (OPEN etc) Coming to the design proposed, I agree with Shyam, Ira and Jeff's thoughts. Defining different compound fops for each specific set of operations and wrapping up those arguments in xdata seem rather complex and difficult to maintain going further. Having being worked with NFS, may I suggest why not we follow (or in similar lines) the approach being taken by NFS protocol to define and implem
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 08:08 PM, Shyam wrote: On 12/09/2015 12:52 AM, Pranith Kumar Karampuri wrote: On 12/09/2015 10:39 AM, Prashanth Pai wrote: However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. From object interface (Swift/S3) perspective, this is the fop order and flow for object operations: GET: open(), fstat(), fgetxattr()s, read()s, close() Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In posix there is an implementation of GF_CONTENT_KEY which is used to read a file in lookup by quick-read. This needs to be exposed for fds as well I think. So you can do all this using fstat on anon-fd. HEAD: stat(), getxattr()s Krutika already implemented this for sharding http://review.gluster.org/10158. You can do this using stat fop. I believe we need to fork this part of the conversation, i.e the stat + xattr information clubbing. My view on a stat for gluster is, POSIX stat + gluster extended information being returned. I state this as, a file system when it stats its inode, should get all information regarding the inode, and not just the POSIX ones. In the case of other local FS, the inode structure has more fields than just what POSIX needs, so when the inode is *read* the FS can populate all its internal inode information and return to the application/syscall the relevant fields that it needs. I believe gluster should do the same, so in the cases above, we should actually extend our stat information (not elaborating how) to include all information from the brick, i.e stat from POSIX and all the extended attrs for the inode (file or dir). This can then be consumed by any layer as needed. Currently, each layer adds what it needs in addition to the stat information in the xdata, as an xattr request, this can continue or go away, if the relevant FOPs return the whole inode information upward. This also has useful outcomes in readdirp calls, where we get the extended stat information for each entry. You can use "list-xattr" in xdata request to get this. With the patches referred to, and older patches, this seems to be the direction sought (around 2013), any reasons why this is not prevalent across the stack and made so? Or am I mistaken? No reason. We can revive it. There didn't seem to be any interest. So I didn't follow up to get it in. Pranith PUT: creat(), write()s, setxattr(), fsync(), close(), rename() This I think should be a new compound fop. Nothing similar exists. DELETE: getxattr(), unlink() This can also be clubbed in unlink already because xdata exists on the wire already. Compounding some of these ops and exposing them as consumable libgfapi APIs like glfs_get() and glfs_put() similar to librados compound APIs[1] would greatly improve performance for object based access. [1]: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219 Thanks. - Prashanth Pai ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 09:32 AM, Jeff Darcy wrote: On December 9, 2015 at 7:07:06 AM, Ira Cooper (i...@redhat.com) wrote: A simple "abort on failure" and let the higher levels clean it up is probably right for the type of compounding I propose. It is what SMB2 does. So, if you get an error return value, cancel the rest of the request, and have it return ECOMPOUND as the errno. This is exactly the part that worries me. If a compound operation fails, some parts of it will often need to be undone. “Let the higher levels clean it up” means that rollback code will be scattered among all of the translators that use compound operations. Some of them will do it right. Others . . . less so. ;) All willl have to be tested separately. If we centralize dispatch of compound operations into one piece of code, we can centralize error detection and recovery likewise. That ensures uniformity of implementation, and facilitates focused testing (or even formal proof) of that implementation. My take on this, is whichever layer started the compounding takes into account the error handling. I do not see any requirement for undoing things that are done, and would almost say (without further thought (that's the gunslinger in me talking ;) )) that this is not supported as a part of the compounding. Can we gain the same benefits with a more generic design? Perhaps. It would require that the compounding translator know how to reverse each type of operation, so that it can do so after an error. That’s feasible, though it does mean maintaining a stack of undo actions instead of a simple state. It might also mean testing combinations and scenarios that will actually never occur in other components’ usage of the compounding feature. More likely it means that people will *think* they can use the facility in unanticipated ways, until their unanticipated usage creates a combination or scenario that was never tested and doesn’t work. Those are going to be hard problems to debug. I think it’s better to be explicit about which permutations we actually expect to work, and have those working earlier. Jeff, a clarification, are you suggesting fop_xxx extensions for each compound operation supported? Or, Suggesting a *single* FOP, that carries compounded requests, but is specific about what requests can be compounded? (for example, allows open+write, but when building out the compound request, disallows *say* anything else) (If any doubt, I am with the latter and not so gaga about the former as it explodes the FOP list) Also, I think the compound list has exploded (in this mail conversation) and provided a lot of compounding requests... I would say this means we need a clear way of doing the latter. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel P.S: Ignore this... gunslinger: "a man who carries a gun and shoots well." I claim to be neither... just stating ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
- Original Message - > From: "Ira Cooper" > To: "Jeff Darcy" , "Raghavendra Gowdappa" > , "Pranith Kumar Karampuri" > > Cc: "Gluster Devel" > Sent: Wednesday, December 9, 2015 5:37:05 PM > Subject: Re: [Gluster-devel] compound fop design first cut > > Jeff Darcy writes: > > > However, I’d be even more comfortable with an even simpler approach that > > avoids the need to solve what the database folks (who have dealt with > > complex transactions for years) would tell us is a really hard problem. > > Instead of designing for every case we can imagine, let’s design for the > > cases that we know would be useful for improving performance. Open plus > > read/write plus close is an obvious one. Raghavendra mentions > > create+inodelk as well. For each of those, we can easily define a > > structure that contains the necessary fields, we don’t need a > > client-side translator, and the server-side translator can take care of > > “forwarding” results from one sub-operation to the next. We could even > > use GF_FOP_IPC to prototype this. If we later find that the number of > > “one-off” compound requests is growing too large, then at least we’ll > > have some experience to guide our design of a more general alternative. > > Right now, I think we’re trying to look further ahead than we can see > > clearly. > > Actually, I'm taking the design, I've seen another network protocol use, > SMB2, and proposing it here, I'd be shocked if NFS doesn't behave in the > same way. > > Interestingly, all the cases, really deal with a single file, and a > single lock, and a single... > > There's a reason I talked about a single sentinel value, and not > multiple ones. Because I wanted to keep it simple. Yes, the extensions > you mention are obvious, but they lead to a giant mess, that we may not > want initially. (But that we CAN extend into if we want them. I made > the choice not to go there because honestly, I found the complexity too > much for me.) > > A simple "abort on failure" and let the higher levels clean it up is > probably right for the type of compounding I propose. It is what SMB2 > does. So, if you get an error return value, cancel the rest of the > request, and have it return ECOMPOUND as the errno. > > Note: How you keep the list to be compounded doesn't matter much to me. > the semantics matter, because those are what I can ask for later, and > allow us to create ops the original desginers hadn't thought of, which > is usually the hallmark of a good design. > > I think you should look for a simple design you can "grow into" instead > of creating one off ops, to satisfy a demand today. > I agree with Ira here. This problem is already addressed by NFS and SMB. So instead of reinventing the wheel lets pick the best bits from these solutions and incorporate in Gluster. From multi-protocol point of view we like to compound operations like open + set_leaseID + lk and many more. With the current approach it would be really messy to have separate functions for each such combinations and a dedicated translator to handle them. As others have mentioned I think it would be better to have a general fop (fop_compound) which can handle compound fop. Each translator can choose to implement it or not. Each translator can take a decision whether to compound more fops or de-compound them. e.g. currently you can make the protocol server de-compound all the compound fops. -Rajesh > My thoughts, > > -Ira > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 02:37 AM, Soumya Koduri wrote: On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote: On 12/09/2015 06:37 AM, Vijay Bellur wrote: On 12/08/2015 03:45 PM, Jeff Darcy wrote: On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: Raghavendra Gowdappa writes: I propose that we define a "compound op" that contains ops. Within each op, there are fields that can be "inherited" from the previous op, via use of a sentinel value. Sentinel is -1, for all of these examples. So: LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 is the root directory, as a gfid, by convention.) OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) WRITE(-1, "foo", 3) (Uses the glfd compound value.) CLOSE(-1) (Uses the glfd compound value) So, basically, what the programming-language types would call futures and promises. It’s a good and well studied concept, which is necessary to solve the second-order problem of how to specify an argument in sub-operation N+1 that’s not known until sub-operation N completes. To be honest, some of the highly general approaches suggested here scare me too. Wrapping up the arguments for one sub-operation in xdata for another would get pretty hairy if we ever try to go beyond two sub-operations and have to nest sub-operation #3’s args within sub-operation #2’s xdata which is itself encoded within sub-operation #1’s xdata. There’s also not much clarity about how to handle errors in that model. Encoding N sub-operations’ arguments in a linear structure as Shyam proposes seems a bit cleaner that way. If I were to continue down that route I’d suggest just having start_compound and end-compound fops, plus an extra field (or by-convention xdata key) that either the client-side or server-side translator could use to build whatever structure it wants and schedule sub-operations however it wants. However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. For each of those, we can easily define a structure that contains the necessary fields, we don’t need a client-side translator, and the server-side translator can take care of “forwarding” results from one sub-operation to the next. We could even use GF_FOP_IPC to prototype this. If we later find that the number of “one-off” compound requests is growing too large, then at least we’ll have some experience to guide our design of a more general alternative. Right now, I think we’re trying to look further ahead than we can see clearly. Yes Agree. This makes implementation on the client side simpler as well. So it is welcome. Just updating the solution. 1) New RPCs are going to be implemented. 2) client stack will use these new fops. 3) On the server side we have server xlator implementing these new fops to decode the RPC request then resolve_resume and compound-op-receiver(Better name for this is welcome) which sends one op after other and send compound fop response. @Pranith, I assume you would expand on this at a later date (something along the lines of what Soumya has done below, right? List of compound fops identified so far: Swift/S3: PUT: creat(), write()s, setxattr(), fsync(), close(), rename() Dht: mkdir + inodelk Afr: xattrop+writev, xattrop+unlock to begin with. Could everyone who needs compound fops add to this list? I see that Niels is back on 14th. Does anyone else know the list of compound fops he has in mind? From the discussions we had with Niels regarding the kerberos support on GlusterFS, I think below are the set of compound fops which are required. set_uid + set_gid + set_lkowner (or kerberos principal name) + actual_fop Also gfapi does lookup (first time/to refresh inode) before performing actual fops most of the times. It may really help if we can club such fops - @Soumya +5 (just a random number :) ) This came to my mind as well, and is a good candidate for compounding. LOOKUP + FOP (OPEN etc) Coming to the design proposed, I agree with Shyam, Ira and Jeff's thoughts. Defining different compound fops for each specific set of operations and wrapping up those arguments in xdata seem rather complex and difficult to maintain going further. Having being worked with NFS, may I suggest why not we follow (or in similar lines) the approach being taken by NFS protocol to define and implement compound procedures. The basic structure of the NFS COMPOUND procedure is: +-+--++---+---+---+-- | tag | minorversion | numops | op + args | op + args | op + args | +-+--++
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 12:52 AM, Pranith Kumar Karampuri wrote: On 12/09/2015 10:39 AM, Prashanth Pai wrote: However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. From object interface (Swift/S3) perspective, this is the fop order and flow for object operations: GET: open(), fstat(), fgetxattr()s, read()s, close() Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In posix there is an implementation of GF_CONTENT_KEY which is used to read a file in lookup by quick-read. This needs to be exposed for fds as well I think. So you can do all this using fstat on anon-fd. HEAD: stat(), getxattr()s Krutika already implemented this for sharding http://review.gluster.org/10158. You can do this using stat fop. I believe we need to fork this part of the conversation, i.e the stat + xattr information clubbing. My view on a stat for gluster is, POSIX stat + gluster extended information being returned. I state this as, a file system when it stats its inode, should get all information regarding the inode, and not just the POSIX ones. In the case of other local FS, the inode structure has more fields than just what POSIX needs, so when the inode is *read* the FS can populate all its internal inode information and return to the application/syscall the relevant fields that it needs. I believe gluster should do the same, so in the cases above, we should actually extend our stat information (not elaborating how) to include all information from the brick, i.e stat from POSIX and all the extended attrs for the inode (file or dir). This can then be consumed by any layer as needed. Currently, each layer adds what it needs in addition to the stat information in the xdata, as an xattr request, this can continue or go away, if the relevant FOPs return the whole inode information upward. This also has useful outcomes in readdirp calls, where we get the extended stat information for each entry. With the patches referred to, and older patches, this seems to be the direction sought (around 2013), any reasons why this is not prevalent across the stack and made so? Or am I mistaken? PUT: creat(), write()s, setxattr(), fsync(), close(), rename() This I think should be a new compound fop. Nothing similar exists. DELETE: getxattr(), unlink() This can also be clubbed in unlink already because xdata exists on the wire already. Compounding some of these ops and exposing them as consumable libgfapi APIs like glfs_get() and glfs_put() similar to librados compound APIs[1] would greatly improve performance for object based access. [1]: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219 Thanks. - Prashanth Pai ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
libgfapi compound fops added inline. - Original Message - > From: "Kotresh Hiremath Ravishankar" > To: "Pranith Kumar Karampuri" > Cc: "Gluster Devel" > Sent: Wednesday, December 9, 2015 2:18:47 PM > Subject: Re: [Gluster-devel] compound fop design first cut > > Geo-rep requirements inline. > > Thanks and Regards, > Kotresh H R > > - Original Message - > > From: "Pranith Kumar Karampuri" > > To: "Vijay Bellur" , "Jeff Darcy" , > > "Raghavendra Gowdappa" > > , "Ira Cooper" > > Cc: "Gluster Devel" > > Sent: Wednesday, December 9, 2015 11:44:52 AM > > Subject: Re: [Gluster-devel] compound fop design first cut > > > > > > > > On 12/09/2015 06:37 AM, Vijay Bellur wrote: > > > On 12/08/2015 03:45 PM, Jeff Darcy wrote: > > >> > > >> > > >> > > >> On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: > > >>> Raghavendra Gowdappa writes: > > >>> I propose that we define a "compound op" that contains ops. > > >>> > > >>> Within each op, there are fields that can be "inherited" from the > > >>> previous op, via use of a sentinel value. > > >>> > > >>> Sentinel is -1, for all of these examples. > > >>> > > >>> So: > > >>> > > >>> LOOKUP (1, "foo") (Sets the gfid value to be picked up by > > >>> compounding, 1 > > >>> is the root directory, as a gfid, by convention.) > > >>> OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) > > >>> WRITE(-1, "foo", 3) (Uses the glfd compound value.) > > >>> CLOSE(-1) (Uses the glfd compound value) > > >> > > >> So, basically, what the programming-language types would call futures > > >> and promises. It’s a good and well studied concept, which is necessary > > >> to solve the second-order problem of how to specify an argument in > > >> sub-operation N+1 that’s not known until sub-operation N completes. > > >> > > >> To be honest, some of the highly general approaches suggested here scare > > >> me too. Wrapping up the arguments for one sub-operation in xdata for > > >> another would get pretty hairy if we ever try to go beyond two > > >> sub-operations and have to nest sub-operation #3’s args within > > >> sub-operation #2’s xdata which is itself encoded within sub-operation > > >> #1’s xdata. There’s also not much clarity about how to handle errors in > > >> that model. Encoding N sub-operations’ arguments in a linear structure > > >> as Shyam proposes seems a bit cleaner that way. If I were to continue > > >> down that route I’d suggest just having start_compound and end-compound > > >> fops, plus an extra field (or by-convention xdata key) that either the > > >> client-side or server-side translator could use to build whatever > > >> structure it wants and schedule sub-operations however it wants. > > >> > > >> However, I’d be even more comfortable with an even simpler approach that > > >> avoids the need to solve what the database folks (who have dealt with > > >> complex transactions for years) would tell us is a really hard problem. > > >> Instead of designing for every case we can imagine, let’s design for the > > >> cases that we know would be useful for improving performance. Open plus > > >> read/write plus close is an obvious one. Raghavendra mentions > > >> create+inodelk as well. For each of those, we can easily define a > > >> structure that contains the necessary fields, we don’t need a > > >> client-side translator, and the server-side translator can take care of > > >> “forwarding” results from one sub-operation to the next. We could even > > >> use GF_FOP_IPC to prototype this. If we later find that the number of > > >> “one-off” compound requests is growing too large, then at least we’ll > > >> have some experience to guide our design of a more general alternative. > > >> Right now, I think we’re trying to look further ahead than we can see > > >> clearly. > > Yes Agree. This makes implementation on the client side simpler as well. > > So it is welcome. > > > > Just updating the solution. > > 1) New RPCs are going to be implemented. >
Re: [Gluster-devel] compound fop design first cut
On December 9, 2015 at 7:07:06 AM, Ira Cooper (i...@redhat.com) wrote: > A simple "abort on failure" and let the higher levels clean it up is > probably right for the type of compounding I propose. It is what SMB2 > does. So, if you get an error return value, cancel the rest of the > request, and have it return ECOMPOUND as the errno. This is exactly the part that worries me. If a compound operation fails, some parts of it will often need to be undone. “Let the higher levels clean it up” means that rollback code will be scattered among all of the translators that use compound operations. Some of them will do it right. Others . . . less so. ;) All willl have to be tested separately. If we centralize dispatch of compound operations into one piece of code, we can centralize error detection and recovery likewise. That ensures uniformity of implementation, and facilitates focused testing (or even formal proof) of that implementation. Can we gain the same benefits with a more generic design? Perhaps. It would require that the compounding translator know how to reverse each type of operation, so that it can do so after an error. That’s feasible, though it does mean maintaining a stack of undo actions instead of a simple state. It might also mean testing combinations and scenarios that will actually never occur in other components’ usage of the compounding feature. More likely it means that people will *think* they can use the facility in unanticipated ways, until their unanticipated usage creates a combination or scenario that was never tested and doesn’t work. Those are going to be hard problems to debug. I think it’s better to be explicit about which permutations we actually expect to work, and have those working earlier. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
Jeff Darcy writes: > However, I’d be even more comfortable with an even simpler approach that > avoids the need to solve what the database folks (who have dealt with > complex transactions for years) would tell us is a really hard problem. > Instead of designing for every case we can imagine, let’s design for the > cases that we know would be useful for improving performance. Open plus > read/write plus close is an obvious one. Raghavendra mentions > create+inodelk as well. For each of those, we can easily define a > structure that contains the necessary fields, we don’t need a > client-side translator, and the server-side translator can take care of > “forwarding” results from one sub-operation to the next. We could even > use GF_FOP_IPC to prototype this. If we later find that the number of > “one-off” compound requests is growing too large, then at least we’ll > have some experience to guide our design of a more general alternative. > Right now, I think we’re trying to look further ahead than we can see > clearly. Actually, I'm taking the design, I've seen another network protocol use, SMB2, and proposing it here, I'd be shocked if NFS doesn't behave in the same way. Interestingly, all the cases, really deal with a single file, and a single lock, and a single... There's a reason I talked about a single sentinel value, and not multiple ones. Because I wanted to keep it simple. Yes, the extensions you mention are obvious, but they lead to a giant mess, that we may not want initially. (But that we CAN extend into if we want them. I made the choice not to go there because honestly, I found the complexity too much for me.) A simple "abort on failure" and let the higher levels clean it up is probably right for the type of compounding I propose. It is what SMB2 does. So, if you get an error return value, cancel the rest of the request, and have it return ECOMPOUND as the errno. Note: How you keep the list to be compounded doesn't matter much to me. the semantics matter, because those are what I can ask for later, and allow us to create ops the original desginers hadn't thought of, which is usually the hallmark of a good design. I think you should look for a simple design you can "grow into" instead of creating one off ops, to satisfy a demand today. My thoughts, -Ira ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
Geo-rep requirements inline. Thanks and Regards, Kotresh H R - Original Message - > From: "Pranith Kumar Karampuri" > To: "Vijay Bellur" , "Jeff Darcy" , > "Raghavendra Gowdappa" > , "Ira Cooper" > Cc: "Gluster Devel" > Sent: Wednesday, December 9, 2015 11:44:52 AM > Subject: Re: [Gluster-devel] compound fop design first cut > > > > On 12/09/2015 06:37 AM, Vijay Bellur wrote: > > On 12/08/2015 03:45 PM, Jeff Darcy wrote: > >> > >> > >> > >> On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: > >>> Raghavendra Gowdappa writes: > >>> I propose that we define a "compound op" that contains ops. > >>> > >>> Within each op, there are fields that can be "inherited" from the > >>> previous op, via use of a sentinel value. > >>> > >>> Sentinel is -1, for all of these examples. > >>> > >>> So: > >>> > >>> LOOKUP (1, "foo") (Sets the gfid value to be picked up by > >>> compounding, 1 > >>> is the root directory, as a gfid, by convention.) > >>> OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) > >>> WRITE(-1, "foo", 3) (Uses the glfd compound value.) > >>> CLOSE(-1) (Uses the glfd compound value) > >> > >> So, basically, what the programming-language types would call futures > >> and promises. It’s a good and well studied concept, which is necessary > >> to solve the second-order problem of how to specify an argument in > >> sub-operation N+1 that’s not known until sub-operation N completes. > >> > >> To be honest, some of the highly general approaches suggested here scare > >> me too. Wrapping up the arguments for one sub-operation in xdata for > >> another would get pretty hairy if we ever try to go beyond two > >> sub-operations and have to nest sub-operation #3’s args within > >> sub-operation #2’s xdata which is itself encoded within sub-operation > >> #1’s xdata. There’s also not much clarity about how to handle errors in > >> that model. Encoding N sub-operations’ arguments in a linear structure > >> as Shyam proposes seems a bit cleaner that way. If I were to continue > >> down that route I’d suggest just having start_compound and end-compound > >> fops, plus an extra field (or by-convention xdata key) that either the > >> client-side or server-side translator could use to build whatever > >> structure it wants and schedule sub-operations however it wants. > >> > >> However, I’d be even more comfortable with an even simpler approach that > >> avoids the need to solve what the database folks (who have dealt with > >> complex transactions for years) would tell us is a really hard problem. > >> Instead of designing for every case we can imagine, let’s design for the > >> cases that we know would be useful for improving performance. Open plus > >> read/write plus close is an obvious one. Raghavendra mentions > >> create+inodelk as well. For each of those, we can easily define a > >> structure that contains the necessary fields, we don’t need a > >> client-side translator, and the server-side translator can take care of > >> “forwarding” results from one sub-operation to the next. We could even > >> use GF_FOP_IPC to prototype this. If we later find that the number of > >> “one-off” compound requests is growing too large, then at least we’ll > >> have some experience to guide our design of a more general alternative. > >> Right now, I think we’re trying to look further ahead than we can see > >> clearly. > Yes Agree. This makes implementation on the client side simpler as well. > So it is welcome. > > Just updating the solution. > 1) New RPCs are going to be implemented. > 2) client stack will use these new fops. > 3) On the server side we have server xlator implementing these new fops > to decode the RPC request then resolve_resume and > compound-op-receiver(Better name for this is welcome) which sends one op > after other and send compound fop response. > > List of compound fops identified so far: > Swift/S3: > PUT: creat(), write()s, setxattr(), fsync(), close(), rename() > > Dht: > mkdir + inodelk > > Afr: > xattrop+writev, xattrop+unlock to begin with. Geo-rep: mknod,entrylk,stat(on backend gfid) mkdir,entrylk,stat (on backend gfid) symlink,entrylk,stat(on backend
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote: On 12/09/2015 06:37 AM, Vijay Bellur wrote: On 12/08/2015 03:45 PM, Jeff Darcy wrote: On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: Raghavendra Gowdappa writes: I propose that we define a "compound op" that contains ops. Within each op, there are fields that can be "inherited" from the previous op, via use of a sentinel value. Sentinel is -1, for all of these examples. So: LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 is the root directory, as a gfid, by convention.) OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) WRITE(-1, "foo", 3) (Uses the glfd compound value.) CLOSE(-1) (Uses the glfd compound value) So, basically, what the programming-language types would call futures and promises. It’s a good and well studied concept, which is necessary to solve the second-order problem of how to specify an argument in sub-operation N+1 that’s not known until sub-operation N completes. To be honest, some of the highly general approaches suggested here scare me too. Wrapping up the arguments for one sub-operation in xdata for another would get pretty hairy if we ever try to go beyond two sub-operations and have to nest sub-operation #3’s args within sub-operation #2’s xdata which is itself encoded within sub-operation #1’s xdata. There’s also not much clarity about how to handle errors in that model. Encoding N sub-operations’ arguments in a linear structure as Shyam proposes seems a bit cleaner that way. If I were to continue down that route I’d suggest just having start_compound and end-compound fops, plus an extra field (or by-convention xdata key) that either the client-side or server-side translator could use to build whatever structure it wants and schedule sub-operations however it wants. However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. For each of those, we can easily define a structure that contains the necessary fields, we don’t need a client-side translator, and the server-side translator can take care of “forwarding” results from one sub-operation to the next. We could even use GF_FOP_IPC to prototype this. If we later find that the number of “one-off” compound requests is growing too large, then at least we’ll have some experience to guide our design of a more general alternative. Right now, I think we’re trying to look further ahead than we can see clearly. Yes Agree. This makes implementation on the client side simpler as well. So it is welcome. Just updating the solution. 1) New RPCs are going to be implemented. 2) client stack will use these new fops. 3) On the server side we have server xlator implementing these new fops to decode the RPC request then resolve_resume and compound-op-receiver(Better name for this is welcome) which sends one op after other and send compound fop response. List of compound fops identified so far: Swift/S3: PUT: creat(), write()s, setxattr(), fsync(), close(), rename() Dht: mkdir + inodelk Afr: xattrop+writev, xattrop+unlock to begin with. Could everyone who needs compound fops add to this list? I see that Niels is back on 14th. Does anyone else know the list of compound fops he has in mind? From the discussions we had with Niels regarding the kerberos support on GlusterFS, I think below are the set of compound fops which are required. set_uid + set_gid + set_lkowner (or kerberos principal name) + actual_fop Also gfapi does lookup (first time/to refresh inode) before performing actual fops most of the times. It may really help if we can club such fops - LOOKUP + FOP (OPEN etc) Coming to the design proposed, I agree with Shyam, Ira and Jeff's thoughts. Defining different compound fops for each specific set of operations and wrapping up those arguments in xdata seem rather complex and difficult to maintain going further. Having being worked with NFS, may I suggest why not we follow (or in similar lines) the approach being taken by NFS protocol to define and implement compound procedures. The basic structure of the NFS COMPOUND procedure is: +-+--++---+---+---+-- | tag | minorversion | numops | op + args | op + args | op + args | +-+--++---+---+---+-- and the reply's structure is: ++-++---+-- |last status | tag | numres | status + op + results | ++-++---+-- Each compound procedure will con
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 06:37 AM, Vijay Bellur wrote: On 12/08/2015 03:45 PM, Jeff Darcy wrote: On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: Raghavendra Gowdappa writes: I propose that we define a "compound op" that contains ops. Within each op, there are fields that can be "inherited" from the previous op, via use of a sentinel value. Sentinel is -1, for all of these examples. So: LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 is the root directory, as a gfid, by convention.) OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) WRITE(-1, "foo", 3) (Uses the glfd compound value.) CLOSE(-1) (Uses the glfd compound value) So, basically, what the programming-language types would call futures and promises. It’s a good and well studied concept, which is necessary to solve the second-order problem of how to specify an argument in sub-operation N+1 that’s not known until sub-operation N completes. To be honest, some of the highly general approaches suggested here scare me too. Wrapping up the arguments for one sub-operation in xdata for another would get pretty hairy if we ever try to go beyond two sub-operations and have to nest sub-operation #3’s args within sub-operation #2’s xdata which is itself encoded within sub-operation #1’s xdata. There’s also not much clarity about how to handle errors in that model. Encoding N sub-operations’ arguments in a linear structure as Shyam proposes seems a bit cleaner that way. If I were to continue down that route I’d suggest just having start_compound and end-compound fops, plus an extra field (or by-convention xdata key) that either the client-side or server-side translator could use to build whatever structure it wants and schedule sub-operations however it wants. However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. For each of those, we can easily define a structure that contains the necessary fields, we don’t need a client-side translator, and the server-side translator can take care of “forwarding” results from one sub-operation to the next. We could even use GF_FOP_IPC to prototype this. If we later find that the number of “one-off” compound requests is growing too large, then at least we’ll have some experience to guide our design of a more general alternative. Right now, I think we’re trying to look further ahead than we can see clearly. Yes Agree. This makes implementation on the client side simpler as well. So it is welcome. Just updating the solution. 1) New RPCs are going to be implemented. 2) client stack will use these new fops. 3) On the server side we have server xlator implementing these new fops to decode the RPC request then resolve_resume and compound-op-receiver(Better name for this is welcome) which sends one op after other and send compound fop response. List of compound fops identified so far: Swift/S3: PUT: creat(), write()s, setxattr(), fsync(), close(), rename() Dht: mkdir + inodelk Afr: xattrop+writev, xattrop+unlock to begin with. Could everyone who needs compound fops add to this list? I see that Niels is back on 14th. Does anyone else know the list of compound fops he has in mind? Pranith. Starting with a well defined set of operations for compounding has its advantages. It would be easier to understand and maintain correctness across the stack. Some of our translators perform transactions & create/update internal metadata for certain fops. It would be easier for such translators if the compound operations are well defined and does not entail deep introspection of a generic representation to ensure that the right behavior gets reflected at the end of a compound operation. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
On 12/09/2015 10:39 AM, Prashanth Pai wrote: However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. From object interface (Swift/S3) perspective, this is the fop order and flow for object operations: GET: open(), fstat(), fgetxattr()s, read()s, close() Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In posix there is an implementation of GF_CONTENT_KEY which is used to read a file in lookup by quick-read. This needs to be exposed for fds as well I think. So you can do all this using fstat on anon-fd. HEAD: stat(), getxattr()s Krutika already implemented this for sharding http://review.gluster.org/10158. You can do this using stat fop. PUT: creat(), write()s, setxattr(), fsync(), close(), rename() This I think should be a new compound fop. Nothing similar exists. DELETE: getxattr(), unlink() This can also be clubbed in unlink already because xdata exists on the wire already. Compounding some of these ops and exposing them as consumable libgfapi APIs like glfs_get() and glfs_put() similar to librados compound APIs[1] would greatly improve performance for object based access. [1]: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219 Thanks. - Prashanth Pai ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
> However, I’d be even more comfortable with an even simpler approach that > avoids the need to solve what the database folks (who have dealt with > complex transactions for years) would tell us is a really hard problem. > Instead of designing for every case we can imagine, let’s design for the > cases that we know would be useful for improving performance. Open plus > read/write plus close is an obvious one. Raghavendra mentions > create+inodelk as well. From object interface (Swift/S3) perspective, this is the fop order and flow for object operations: GET: open(), fstat(), fgetxattr()s, read()s, close() HEAD: stat(), getxattr()s PUT: creat(), write()s, setxattr(), fsync(), close(), rename() DELETE: getxattr(), unlink() Compounding some of these ops and exposing them as consumable libgfapi APIs like glfs_get() and glfs_put() similar to librados compound APIs[1] would greatly improve performance for object based access. [1]: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219 Thanks. - Prashanth Pai ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
On 12/08/2015 03:45 PM, Jeff Darcy wrote: On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: Raghavendra Gowdappa writes: I propose that we define a "compound op" that contains ops. Within each op, there are fields that can be "inherited" from the previous op, via use of a sentinel value. Sentinel is -1, for all of these examples. So: LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 is the root directory, as a gfid, by convention.) OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) WRITE(-1, "foo", 3) (Uses the glfd compound value.) CLOSE(-1) (Uses the glfd compound value) So, basically, what the programming-language types would call futures and promises. It’s a good and well studied concept, which is necessary to solve the second-order problem of how to specify an argument in sub-operation N+1 that’s not known until sub-operation N completes. To be honest, some of the highly general approaches suggested here scare me too. Wrapping up the arguments for one sub-operation in xdata for another would get pretty hairy if we ever try to go beyond two sub-operations and have to nest sub-operation #3’s args within sub-operation #2’s xdata which is itself encoded within sub-operation #1’s xdata. There’s also not much clarity about how to handle errors in that model. Encoding N sub-operations’ arguments in a linear structure as Shyam proposes seems a bit cleaner that way. If I were to continue down that route I’d suggest just having start_compound and end-compound fops, plus an extra field (or by-convention xdata key) that either the client-side or server-side translator could use to build whatever structure it wants and schedule sub-operations however it wants. However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. For each of those, we can easily define a structure that contains the necessary fields, we don’t need a client-side translator, and the server-side translator can take care of “forwarding” results from one sub-operation to the next. We could even use GF_FOP_IPC to prototype this. If we later find that the number of “one-off” compound requests is growing too large, then at least we’ll have some experience to guide our design of a more general alternative. Right now, I think we’re trying to look further ahead than we can see clearly. Starting with a well defined set of operations for compounding has its advantages. It would be easier to understand and maintain correctness across the stack. Some of our translators perform transactions & create/update internal metadata for certain fops. It would be easier for such translators if the compound operations are well defined and does not entail deep introspection of a generic representation to ensure that the right behavior gets reflected at the end of a compound operation. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote: > Raghavendra Gowdappa writes: > I propose that we define a "compound op" that contains ops. > > Within each op, there are fields that can be "inherited" from the > previous op, via use of a sentinel value. > > Sentinel is -1, for all of these examples. > > So: > > LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 > is the root directory, as a gfid, by convention.) > OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) > WRITE(-1, "foo", 3) (Uses the glfd compound value.) > CLOSE(-1) (Uses the glfd compound value) So, basically, what the programming-language types would call futures and promises. It’s a good and well studied concept, which is necessary to solve the second-order problem of how to specify an argument in sub-operation N+1 that’s not known until sub-operation N completes. To be honest, some of the highly general approaches suggested here scare me too. Wrapping up the arguments for one sub-operation in xdata for another would get pretty hairy if we ever try to go beyond two sub-operations and have to nest sub-operation #3’s args within sub-operation #2’s xdata which is itself encoded within sub-operation #1’s xdata. There’s also not much clarity about how to handle errors in that model. Encoding N sub-operations’ arguments in a linear structure as Shyam proposes seems a bit cleaner that way. If I were to continue down that route I’d suggest just having start_compound and end-compound fops, plus an extra field (or by-convention xdata key) that either the client-side or server-side translator could use to build whatever structure it wants and schedule sub-operations however it wants. However, I’d be even more comfortable with an even simpler approach that avoids the need to solve what the database folks (who have dealt with complex transactions for years) would tell us is a really hard problem. Instead of designing for every case we can imagine, let’s design for the cases that we know would be useful for improving performance. Open plus read/write plus close is an obvious one. Raghavendra mentions create+inodelk as well. For each of those, we can easily define a structure that contains the necessary fields, we don’t need a client-side translator, and the server-side translator can take care of “forwarding” results from one sub-operation to the next. We could even use GF_FOP_IPC to prototype this. If we later find that the number of “one-off” compound requests is growing too large, then at least we’ll have some experience to guide our design of a more general alternative. Right now, I think we’re trying to look further ahead than we can see clearly. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
Raghavendra Gowdappa writes: > From what I can see, new compound ops will _evolve_ in future based on > requirements unseen as of now. Yes, That is the one thing you can count on here ;) The compounding architecture proposed here, scares me to be honest. The complexity it can create is pretty immense. I'm going to suggest a simpler scheme to you, there is no API provided, but I think you'll see the idea, which is the key. Then APIs and whatnot can follow. In the proposal today, if I want to compound op A and B, I have to write compound_A_and_B basically. That will create headaches for anyone who comes up with neat ideas :(. Or needs to do longer and longer compounds. I propose that we define a "compound op" that contains ops. Within each op, there are fields that can be "inherited" from the previous op, via use of a sentinel value. Sentinel is -1, for all of these examples. So: LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1 is the root directory, as a gfid, by convention.) OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.) WRITE(-1, "foo", 3) (Uses the glfd compound value.) CLOSE(-1) (Uses the glfd compound value) Note, that we can define what fields can take "sentinel" values, so things like leases, locks etc, can all be handled properly. The other trick is, if we return an error, we MUST stop the compound, and return the rest of the return results as ECOMPOUND or some similar value. The actual thing that errored should return proper error codes. Now, the cute thing about this is that a translator can look at a compound stream, element by element, and decide what to do with it, or if you need to break the stream and handle the compound semantics yourself or what. So this actually fits well with gluster's architecture, of being very composable :). I'm interested in your thoughts on where the edges of this proposal may be, and if it meets your needs. Thanks, -Ira ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] compound fop design first cut
> > On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote: > > > > > > On 12/08/2015 02:53 AM, Shyam wrote: > >> Hi, > >> > >> Why not think along the lines of new FOPs like fop_compound(_cbk) > >> where, the inargs to this FOP is a list of FOPs to execute (either in > >> order or any order)? > > That is the intent. The question is how do we specify the fops that we > > want to do and the arguments to the fop. In this approach, for example > > xl_fxattrop_writev() is a new FOP. List of fops that need to be done > > are fxattrop, writev in that order and the arguments are a union of > > the arguments needed to perform the fops fxattrop, writev. The reason > > why this fop is not implemented through out the graph is to not change > > most of the stack on the brick side in the first cut of the > > implementation. i.e. quota/barrier/geo-rep/io-threads > > priorities/bit-rot may have to implement these new compund fops. We > > still get the benefit of avoiding the network round trips. > >> > >> With a scheme like the above we could, > >> - compound any set of FOPs (of course, we need to take care here, > >> but still the feasibility exists) > > It still exists but the fop space will be blown for each of the > > combination. > >> - Each xlator can inspect the compound relation and chose to > >> uncompound them. So if an xlator cannot perform FOPA+B as a single > >> compound FOP, it can choose to send FOPA and then FOPB and chain up > >> the responses back to the compound request sent to it. Also, the > >> intention here would be to leverage existing FOP code in any xlator, > >> to appropriately modify the inargs > >> - The RPC payload is constructed based on existing FOP RPC > >> definitions, but compounded based on the compound FOP RPC definition > > This will be done in phase-3 after learning a bit more about how best > > to implement it to prevent stuffing arguments in xdata in future as > > much as possible. After which we can choose to retire > > compound-fop-sender and receiver xlators. > >> > >> Possibly on the brick graph as well, pass these down as compounded > >> FOPs, till someone decides to break it open and do it in phases > >> (ultimately POSIX xlator). > > This will be done in phase-2. At the moment we are not giving any > > choice for the xlators on the brick side. > >> > >> The intention would be to break a compound FOP in case an xlator in > >> between cannot support it or, even expand a compound FOP request, say > >> the fxattropAndWrite is an AFR compounding decision, but a compound > >> request to AFR maybe WriteandClose, hence AFR needs to extend this > >> compound request. > > Yes. There was a discussion with krutika where if shard wants to do > > write then xattrop in a single fop, then we need dht to implement > > dht_writev_fxattrop which should look somewhat similar to > > dht_writev(), and afr will need to implement afr_writev_fxattrop() as > > full blown transaction where it needs to take data+metadata domain > > locks then do data+metadata pre-op then wind to > > compound_fop_sender_writev_fxattrop() and then data+metadata post-op > > then unlocks. > > > > If we were to do writev, fxattrop separately, fops will be (In > > unoptimized case): > > 1) finodelk for write > > 2) fxattrop for preop of write. > > 3) write > > 4) fxattrop for post op of write > > 5) unlock for write > > 6) finodelk for fxattrop > > 7) fxattrop for preop of shard-fxattrop > > 8) shard-fxattrop > > 9) fxattrop for post op of shard fxattrop > > 10) unlock forfxattrop > > > > If AFR chooses to implement writev_fxattrop: means data+metadata > > transaction. > > 1) finodelk in data, metadata domain simultaneously (just like we take > > multiple locks in rename) > > 2) preop for data, metadata parts as part of the compound fop > > 3) writev+fxattrop > > 4)postop for data, metadata parts as part of the compound fop > > 5) unlocks simultaneously. > > > > So it is still 2x reduction of the number of network fops except for > > may be locking. > >> > >> The above is just a off the cuff thought on the same. > > We need to arrive at a consensus about how to specify the list of fops > > and their arguments. The reason why I went against list_of_fops is to > > make discovery of possibile optimizations we can do easier per > > compound fop (Inspired by ec's implementation of multiplications by > > all possible elements in the Galois field, where multiplication with > > different number has a different optimization). Could you elaborate > > more about the idea you have about list_of_fops and its arguments? May > > be we can come up with combinations of fops where we can employ this > > technique of just list_of_fops and wind. I think rest of the solutions > > you mentioned is where it will converge towards over time. Intention > > is to avoid network round trips without waiting for the whole stack to > > change as much as possible. > May be I am over thinking it. Not a lot of combinations could be > transactions. In any
Re: [Gluster-devel] compound fop design first cut
On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote: On 12/08/2015 02:53 AM, Shyam wrote: Hi, Why not think along the lines of new FOPs like fop_compound(_cbk) where, the inargs to this FOP is a list of FOPs to execute (either in order or any order)? That is the intent. The question is how do we specify the fops that we want to do and the arguments to the fop. In this approach, for example xl_fxattrop_writev() is a new FOP. List of fops that need to be done are fxattrop, writev in that order and the arguments are a union of the arguments needed to perform the fops fxattrop, writev. The reason why this fop is not implemented through out the graph is to not change most of the stack on the brick side in the first cut of the implementation. i.e. quota/barrier/geo-rep/io-threads priorities/bit-rot may have to implement these new compund fops. We still get the benefit of avoiding the network round trips. With a scheme like the above we could, - compound any set of FOPs (of course, we need to take care here, but still the feasibility exists) It still exists but the fop space will be blown for each of the combination. - Each xlator can inspect the compound relation and chose to uncompound them. So if an xlator cannot perform FOPA+B as a single compound FOP, it can choose to send FOPA and then FOPB and chain up the responses back to the compound request sent to it. Also, the intention here would be to leverage existing FOP code in any xlator, to appropriately modify the inargs - The RPC payload is constructed based on existing FOP RPC definitions, but compounded based on the compound FOP RPC definition This will be done in phase-3 after learning a bit more about how best to implement it to prevent stuffing arguments in xdata in future as much as possible. After which we can choose to retire compound-fop-sender and receiver xlators. Possibly on the brick graph as well, pass these down as compounded FOPs, till someone decides to break it open and do it in phases (ultimately POSIX xlator). This will be done in phase-2. At the moment we are not giving any choice for the xlators on the brick side. The intention would be to break a compound FOP in case an xlator in between cannot support it or, even expand a compound FOP request, say the fxattropAndWrite is an AFR compounding decision, but a compound request to AFR maybe WriteandClose, hence AFR needs to extend this compound request. Yes. There was a discussion with krutika where if shard wants to do write then xattrop in a single fop, then we need dht to implement dht_writev_fxattrop which should look somewhat similar to dht_writev(), and afr will need to implement afr_writev_fxattrop() as full blown transaction where it needs to take data+metadata domain locks then do data+metadata pre-op then wind to compound_fop_sender_writev_fxattrop() and then data+metadata post-op then unlocks. If we were to do writev, fxattrop separately, fops will be (In unoptimized case): 1) finodelk for write 2) fxattrop for preop of write. 3) write 4) fxattrop for post op of write 5) unlock for write 6) finodelk for fxattrop 7) fxattrop for preop of shard-fxattrop 8) shard-fxattrop 9) fxattrop for post op of shard fxattrop 10) unlock forfxattrop If AFR chooses to implement writev_fxattrop: means data+metadata transaction. 1) finodelk in data, metadata domain simultaneously (just like we take multiple locks in rename) 2) preop for data, metadata parts as part of the compound fop 3) writev+fxattrop 4)postop for data, metadata parts as part of the compound fop 5) unlocks simultaneously. So it is still 2x reduction of the number of network fops except for may be locking. The above is just a off the cuff thought on the same. We need to arrive at a consensus about how to specify the list of fops and their arguments. The reason why I went against list_of_fops is to make discovery of possibile optimizations we can do easier per compound fop (Inspired by ec's implementation of multiplications by all possible elements in the Galois field, where multiplication with different number has a different optimization). Could you elaborate more about the idea you have about list_of_fops and its arguments? May be we can come up with combinations of fops where we can employ this technique of just list_of_fops and wind. I think rest of the solutions you mentioned is where it will converge towards over time. Intention is to avoid network round trips without waiting for the whole stack to change as much as possible. May be I am over thinking it. Not a lot of combinations could be transactions. In any case do let me know what you have in mind. Pranith The scheme below seems too specific to my eyes, and looks like we would be defining specific compound FOPs than the ability to have generic ones. On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote: hi, Draft of the design doc: Main motivation for the design of this feature is
Re: [Gluster-devel] compound fop design first cut
On 12/08/2015 02:53 AM, Shyam wrote: Hi, Why not think along the lines of new FOPs like fop_compound(_cbk) where, the inargs to this FOP is a list of FOPs to execute (either in order or any order)? That is the intent. The question is how do we specify the fops that we want to do and the arguments to the fop. In this approach, for example xl_fxattrop_writev() is a new FOP. List of fops that need to be done are fxattrop, writev in that order and the arguments are a union of the arguments needed to perform the fops fxattrop, writev. The reason why this fop is not implemented through out the graph is to not change most of the stack on the brick side in the first cut of the implementation. i.e. quota/barrier/geo-rep/io-threads priorities/bit-rot may have to implement these new compund fops. We still get the benefit of avoiding the network round trips. With a scheme like the above we could, - compound any set of FOPs (of course, we need to take care here, but still the feasibility exists) It still exists but the fop space will be blown for each of the combination. - Each xlator can inspect the compound relation and chose to uncompound them. So if an xlator cannot perform FOPA+B as a single compound FOP, it can choose to send FOPA and then FOPB and chain up the responses back to the compound request sent to it. Also, the intention here would be to leverage existing FOP code in any xlator, to appropriately modify the inargs - The RPC payload is constructed based on existing FOP RPC definitions, but compounded based on the compound FOP RPC definition This will be done in phase-3 after learning a bit more about how best to implement it to prevent stuffing arguments in xdata in future as much as possible. After which we can choose to retire compound-fop-sender and receiver xlators. Possibly on the brick graph as well, pass these down as compounded FOPs, till someone decides to break it open and do it in phases (ultimately POSIX xlator). This will be done in phase-2. At the moment we are not giving any choice for the xlators on the brick side. The intention would be to break a compound FOP in case an xlator in between cannot support it or, even expand a compound FOP request, say the fxattropAndWrite is an AFR compounding decision, but a compound request to AFR maybe WriteandClose, hence AFR needs to extend this compound request. Yes. There was a discussion with krutika where if shard wants to do write then xattrop in a single fop, then we need dht to implement dht_writev_fxattrop which should look somewhat similar to dht_writev(), and afr will need to implement afr_writev_fxattrop() as full blown transaction where it needs to take data+metadata domain locks then do data+metadata pre-op then wind to compound_fop_sender_writev_fxattrop() and then data+metadata post-op then unlocks. If we were to do writev, fxattrop separately, fops will be (In unoptimized case): 1) finodelk for write 2) fxattrop for preop of write. 3) write 4) fxattrop for post op of write 5) unlock for write 6) finodelk for fxattrop 7) fxattrop for preop of shard-fxattrop 8) shard-fxattrop 9) fxattrop for post op of shard fxattrop 10) unlock forfxattrop If AFR chooses to implement writev_fxattrop: means data+metadata transaction. 1) finodelk in data, metadata domain simultaneously (just like we take multiple locks in rename) 2) preop for data, metadata parts as part of the compound fop 3) writev+fxattrop 4)postop for data, metadata parts as part of the compound fop 5) unlocks simultaneously. So it is still 2x reduction of the number of network fops except for may be locking. The above is just a off the cuff thought on the same. We need to arrive at a consensus about how to specify the list of fops and their arguments. The reason why I went against list_of_fops is to make discovery of possibile optimizations we can do easier per compound fop (Inspired by ec's implementation of multiplications by all possible elements in the Galois field, where multiplication with different number has a different optimization). Could you elaborate more about the idea you have about list_of_fops and its arguments? May be we can come up with combinations of fops where we can employ this technique of just list_of_fops and wind. I think rest of the solutions you mentioned is where it will converge towards over time. Intention is to avoid network round trips without waiting for the whole stack to change as much as possible. Pranith The scheme below seems too specific to my eyes, and looks like we would be defining specific compound FOPs than the ability to have generic ones. On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote: hi, Draft of the design doc: Main motivation for the design of this feature is to reduce network round trips by sending more than one fop in a network operation, preferably without introducing new rpcs. There are new 2 new xlators compound-fop-sender, compound-fop-rece
Re: [Gluster-devel] compound fop design first cut
Hi, Why not think along the lines of new FOPs like fop_compound(_cbk) where, the inargs to this FOP is a list of FOPs to execute (either in order or any order)? With a scheme like the above we could, - compound any set of FOPs (of course, we need to take care here, but still the feasibility exists) - Each xlator can inspect the compound relation and chose to uncompound them. So if an xlator cannot perform FOPA+B as a single compound FOP, it can choose to send FOPA and then FOPB and chain up the responses back to the compound request sent to it. Also, the intention here would be to leverage existing FOP code in any xlator, to appropriately modify the inargs - The RPC payload is constructed based on existing FOP RPC definitions, but compounded based on the compound FOP RPC definition Possibly on the brick graph as well, pass these down as compounded FOPs, till someone decides to break it open and do it in phases (ultimately POSIX xlator). The intention would be to break a compound FOP in case an xlator in between cannot support it or, even expand a compound FOP request, say the fxattropAndWrite is an AFR compounding decision, but a compound request to AFR maybe WriteandClose, hence AFR needs to extend this compound request. The above is just a off the cuff thought on the same. The scheme below seems too specific to my eyes, and looks like we would be defining specific compound FOPs than the ability to have generic ones. On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote: hi, Draft of the design doc: Main motivation for the design of this feature is to reduce network round trips by sending more than one fop in a network operation, preferably without introducing new rpcs. There are new 2 new xlators compound-fop-sender, compound-fop-receiver. compound-fop-sender is going to be loaded on top of each client-xlator on the mount/client and compound-fop-receiver is going to be loaded below server-xlator on the bricks. On the mount/client side from the caller xlator till compund-fop-encoder xlator, the xlators can choose to implement this extra compound fop handling. Once it reaches "compound-fop-sender" it will try to choose a base fop on which it encodes the other fop in the base-fop's xdata, and winds the base fop to client xlator(). client xlator sends the base fop with encoded xdata to server xlator on the brick using rpc of the base fop. Once server xlator does resolve_and_resume() it will wind the base fop to compound-fop-receiver xlator. This fop will decode the extra fop from xdata of the base-fop. Based on the order encoded in the xdata it executes separate fops one after the other and stores the cbk response arguments of both the operations. It again encodes the response of the extra fop on to the base fop's response xdata and unwind the fop to server xlator. Sends the response using base-rpc's response structure. Client xlator will unwind the base fop to compound-fop-sender, which will decode the response to the compound fop's response arguments of the compound fop and unwind to the parent xlators. I will take an example of fxattrop+write operation that we want to implement in afr as an example to explain how things may look. compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this, fd_t * fd, gf_xattrop_flags_t flags, dict_t * fxattrop_dict, dict_t * fxattrop_xdata, struct iovec * vector, int32_t count, off_t off, uint32_t flags, struct iobref * iobref, dict_t * writev_xdata) ) { 0) Remember the compound-fop take base-fop as write() in wriev_xdata add the following key,value pairs 1) "xattrop-flags", flags 2) for-each-fxattrop_dict key -> "fxattrop-dict-", value 3) for-each-fxattrop_xdata key -> "fxattrop-xdata-", value 4) "order" -> "fxattrop, writev" 5) "compound-fops" -> "fxattrop" 6) Wind writev() } compound_fop_sender_fxattrop_write_cbk(...) { /*decode the response args and call parent_fxattrop_write_cbk*/ } _fxattrop_write_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int32_t fxattrop_op_ret, int32_t fxattrop_op_errno, dict_t *fxattrop_dict, dict_t *fxattrop_xdata, int32_t writev_op_ret, int32_t writev_op_errno, struct iatt *writev_prebuf, struct iatt *writev_postbuf, dict_t *writev_xdata) { /**/ } compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this, fd_t * fd, struct iovec * vector, int32_t count, off_t off, uint32_t flags, struct iobref * iobref, dict_t * writev_xdata) {
[Gluster-devel] compound fop design first cut
hi, Draft of the design doc: Main motivation for the design of this feature is to reduce network round trips by sending more than one fop in a network operation, preferably without introducing new rpcs. There are new 2 new xlators compound-fop-sender, compound-fop-receiver. compound-fop-sender is going to be loaded on top of each client-xlator on the mount/client and compound-fop-receiver is going to be loaded below server-xlator on the bricks. On the mount/client side from the caller xlator till compund-fop-encoder xlator, the xlators can choose to implement this extra compound fop handling. Once it reaches "compound-fop-sender" it will try to choose a base fop on which it encodes the other fop in the base-fop's xdata, and winds the base fop to client xlator(). client xlator sends the base fop with encoded xdata to server xlator on the brick using rpc of the base fop. Once server xlator does resolve_and_resume() it will wind the base fop to compound-fop-receiver xlator. This fop will decode the extra fop from xdata of the base-fop. Based on the order encoded in the xdata it executes separate fops one after the other and stores the cbk response arguments of both the operations. It again encodes the response of the extra fop on to the base fop's response xdata and unwind the fop to server xlator. Sends the response using base-rpc's response structure. Client xlator will unwind the base fop to compound-fop-sender, which will decode the response to the compound fop's response arguments of the compound fop and unwind to the parent xlators. I will take an example of fxattrop+write operation that we want to implement in afr as an example to explain how things may look. compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this, fd_t * fd, gf_xattrop_flags_t flags, dict_t * fxattrop_dict, dict_t * fxattrop_xdata, struct iovec * vector, int32_t count, off_t off, uint32_t flags, struct iobref * iobref, dict_t * writev_xdata) ) { 0) Remember the compound-fop take base-fop as write() in wriev_xdata add the following key,value pairs 1) "xattrop-flags", flags 2) for-each-fxattrop_dict key -> "fxattrop-dict-", value 3) for-each-fxattrop_xdata key -> "fxattrop-xdata-", value 4) "order" -> "fxattrop, writev" 5) "compound-fops" -> "fxattrop" 6) Wind writev() } compound_fop_sender_fxattrop_write_cbk(...) { /*decode the response args and call parent_fxattrop_write_cbk*/ } _fxattrop_write_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int32_t fxattrop_op_ret, int32_t fxattrop_op_errno, dict_t *fxattrop_dict, dict_t *fxattrop_xdata, int32_t writev_op_ret, int32_t writev_op_errno, struct iatt *writev_prebuf, struct iatt *writev_postbuf, dict_t *writev_xdata) { /**/ } compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this, fd_t * fd, struct iovec * vector, int32_t count, off_t off, uint32_t flags, struct iobref * iobref, dict_t * writev_xdata) { 0) Check if writev_xdata has "compound-fop" else default_writev() 2) decode writev_xdata from above encoding -> flags, fxattrop_dict, fxattrop-xdata 3) get "order" 4) Store all the above in 'local' 5) wind fxattrop() with compound_receiver_fxattrop_cbk_writev_wind() as cbk } compound_receiver_fxattrop_cbk_writev_wind (call_frame_t *frame, void *cookie, xlator_t *this, int32_t op_ret, int32_t op_errno, dict_t *dict, dict_t *xdata) { 0) store fxattrop cbk_args 1) Perform writev() with writev_params with compound_receiver_writev_cbk() as the 'cbk' } compound_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int32_t op_ret, int32_t op_errno, struct iatt *prebuf, struct iatt *postbuf, dict_t *xdata) { 0) store writev cbk_args 1) Encode fxattrop response to writev_xdata with similar encoding in the compound_fop_sender_fxattrop_write() 2) unwind writev() } This example is just to show how things may look, but the actual implementation may just have all base-fops calling common function to perform the operations in the order given in the receriver xl. Yet to think about that. It is probably better to Encode fop-number from glusterfs_fop_t rather than the fop-string in the dictionary. This is phase-1 of the change because we don't wa