Re: [Gluster-devel] compound fop design first cut

2016-01-06 Thread Pranith Kumar Karampuri



On 01/06/2016 07:50 PM, Jeff Darcy wrote:

1) fops will be compounded per inode, meaning 2 fops on different
inodes can't be compounded (Not because of the design, Just reducing
scope of the problem).

2) Each xlator that wants a compound fop packs the arguments by
itself.

Packed how?  Are we talking about XDR here, or something else?  How is
dict_t handled?  Will there be generic packing/unpacking code somewhere,
or is each translator expected to do this manually?


Packed as mentioned in step-4 below. There will be common functions 
provided which will fill an array cell with the given information to the 
function for that fop. In conjunction to that there will be filling 
functions for each of the compound fops listed at: 
https://public.pad.fsfe.org/p/glusterfs-compound-fops. XDR should be 
similar to what Soumya suggested in earlier mails just like in NFS.





3) On the server side a de-compounder placed below server xlator
unpacks the arguments and does the necessary operations.

4) Arguments for compound fops will be passed as array of union of
structures where each structure is associated with a fop.

5) Each xlator will have _compound_fop () which receives the
fop and does additional processing that is required for itself.

What happens when (not if) some translator fails to provide this?  Is
there a default function?  Is there something at the end of the chain
that will log an error if the fop gets that far without being handled
(as with GF_FOP_IPC)?


Yes there will be default_fop provided just like other fops which is 
just a pass through. Posix will log unwind with -1, ENOTSUPP.





6) Response will also be an array of union of response structures
where each structure is associated with a fop's response.

What are the error semantics?  Does processing of a series always stop
at the first error, or are there some errors that allow retry/continue?
If/when processing stops, who's responsible for cleaning up state
changed by those parts that succeeded?  What happens if the connection
dies in the middle?


Yes, at the moment we are implementing stop at first error semantics as 
it seems to satisfy all the compound fops we listed @ 
https://public.pad.fsfe.org/p/glusterfs-compound-fops. Each translator 
which looks to handle the compound fop should handle failures just like 
they do for normal fop at the moment.




How are values returned from one operation in a series propagated as
arguments for the next?


They are not. In the first cut the only dependency between two fops now 
is whether the previous one succeeded or not. Just this much seems to 
work fine for the fops we are targeting for now: 
https://public.pad.fsfe.org/p/glusterfs-compound-fops, We may have to 
enhance it in future based on what will come up in the future.




What are the implications for buffer and message sizes?  What are the
limits on how large these can get, and/or how many operations can be
compounded?


It depends on the limits imposed by rpc layer. If it can't send the 
request, the fop will fail. If it can send the request but the response 
is too big to send back, I think the fop will lead to error by frame 
timeout for the response. Either way it will be a failure. At the moment 
for the fops listed at: 
https://public.pad.fsfe.org/p/glusterfs-compound-fops this doesn't seem 
to be a problem.




How is synchronization handled?  Is the inode locked for the duration of
the compound operation, to prevent other operations from changing the
context in which later parts of the compound operation execute?  Are
there possibilities for deadlock here?  Alternatively, if no locking is
done, are we going to document the fact that compound operations are not
atomic/linearizable?


Since we are limiting the scope to single inode fops, locking should 
suffice. EC doesn't have any problem as it just has one lock for both 
data/entry, metadata locks. In afr we need to come up with locking order 
for metadata, data domains. Something similar to what we do in rename 
where we need to take multiple locks.


Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] compound fop design first cut

2016-01-06 Thread Jeff Darcy
> 1) fops will be compounded per inode, meaning 2 fops on different
> inodes can't be compounded (Not because of the design, Just reducing
> scope of the problem).
>
> 2) Each xlator that wants a compound fop packs the arguments by
> itself.

Packed how?  Are we talking about XDR here, or something else?  How is
dict_t handled?  Will there be generic packing/unpacking code somewhere,
or is each translator expected to do this manually?

> 3) On the server side a de-compounder placed below server xlator
> unpacks the arguments and does the necessary operations.
>
> 4) Arguments for compound fops will be passed as array of union of
> structures where each structure is associated with a fop.
>
> 5) Each xlator will have _compound_fop () which receives the
> fop and does additional processing that is required for itself.

What happens when (not if) some translator fails to provide this?  Is
there a default function?  Is there something at the end of the chain
that will log an error if the fop gets that far without being handled
(as with GF_FOP_IPC)?

> 6) Response will also be an array of union of response structures
> where each structure is associated with a fop's response.

What are the error semantics?  Does processing of a series always stop
at the first error, or are there some errors that allow retry/continue?
If/when processing stops, who's responsible for cleaning up state
changed by those parts that succeeded?  What happens if the connection
dies in the middle?

How are values returned from one operation in a series propagated as
arguments for the next?

What are the implications for buffer and message sizes?  What are the
limits on how large these can get, and/or how many operations can be
compounded?

How is synchronization handled?  Is the inode locked for the duration of
the compound operation, to prevent other operations from changing the
context in which later parts of the compound operation execute?  Are
there possibilities for deadlock here?  Alternatively, if no locking is
done, are we going to document the fact that compound operations are not
atomic/linearizable?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] compound fop design first cut

2016-01-06 Thread Anuradha Talur
Hi,

After discussions with Pranith and Soumya, here is the design for compound fops:

1) fops will be compounded per inode, meaning 2 fops on different inodes can't 
be compounded (Not because of the design, Just reducing scope of the problem).
2) Each xlator that wants a compound fop packs the arguments by itself.
3) On the server side a de-compounder placed below server xlator unpacks the 
arguments and does the necessary operations.
4) Arguments for compound fops will be passed as array of union of structures 
where each structure is associated with a fop.
5) Each xlator will have _compound_fop () which receives the fop and 
does additional processing that is required for itself.
6) Response will also be an array of union of response structures where each 
structure is associated with a fop's response.

Comments welcome!

- Original Message -
> From: "Milind Changire" 
> To: "Jeff Darcy" 
> Cc: "Gluster Devel" 
> Sent: Friday, December 11, 2015 9:25:38 PM
> Subject: Re: [Gluster-devel] compound fop design first cut
> 
> 
> 
> On Wed, Dec 9, 2015 at 8:02 PM, Jeff Darcy < jda...@redhat.com > wrote:
> 
> 
> 
> 
> 
> On December 9, 2015 at 7:07:06 AM, Ira Cooper ( i...@redhat.com ) wrote:
> > A simple "abort on failure" and let the higher levels clean it up is
> > probably right for the type of compounding I propose. It is what SMB2
> > does. So, if you get an error return value, cancel the rest of the
> > request, and have it return ECOMPOUND as the errno.
> 
> This is exactly the part that worries me. If a compound operation
> fails, some parts of it will often need to be undone. “Let the higher
> levels clean it up” means that rollback code will be scattered among all
> of the translators that use compound operations. Some of them will do
> it right. Others . . . less so. ;) All willl have to be tested
> separately. If we centralize dispatch of compound operations into one
> piece of code, we can centralize error detection and recovery likewise.
> That ensures uniformity of implementation, and facilitates focused
> testing (or even formal proof) of that implementation.
> 
> Can we gain the same benefits with a more generic design? Perhaps. It
> would require that the compounding translator know how to reverse each
> type of operation, so that it can do so after an error. That’s
> feasible, though it does mean maintaining a stack of undo actions
> instead of a simple state. It might also mean testing combinations and
> scenarios that will actually never occur in other components’ usage of
> the compounding feature. More likely it means that people will *think*
> they can use the facility in unanticipated ways, until their
> unanticipated usage creates a combination or scenario that was never
> tested and doesn’t work. Those are going to be hard problems to debug.
> I think it’s better to be explicit about which permutations we actually
> expect to work, and have those working earlier.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> 
> 
> Could we have a dry-run phase and a commit phase for the compound operation.
> The dry-run phase phase could test the validity of the transaction and the
> commit phase can actually perform the operation.
> 
> If any of the operation in the dry-run operation sequence returns error, the
> compound operation can be aborted immediately without the complexity of an
> undo ... scattered or centralized.
> 
> But if the subsequent operations depend on the changed state of the system
> from earlier operations, then we'll have to introduce a system state object
> for such transactions ... and maybe serialize such operations. The system
> state object can be passed through the operation sequence. How well this
> idea would work in a multi-threaded world is not clear to me too.
> 
> 
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel

-- 
Thanks,
Anuradha.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-11 Thread Milind Changire
On Wed, Dec 9, 2015 at 8:02 PM, Jeff Darcy  wrote:

>
>
>
> On December 9, 2015 at 7:07:06 AM, Ira Cooper (i...@redhat.com) wrote:
> > A simple "abort on failure" and let the higher levels clean it up is
> > probably right for the type of compounding I propose. It is what SMB2
> > does. So, if you get an error return value, cancel the rest of the
> > request, and have it return ECOMPOUND as the errno.
>
> This is exactly the part that worries me.  If a compound operation
> fails, some parts of it will often need to be undone.  “Let the higher
> levels clean it up” means that rollback code will be scattered among all
> of the translators that use compound operations.  Some of them will do
> it right.  Others . . . less so.  ;)  All willl have to be tested
> separately.  If we centralize dispatch of compound operations into one
> piece of code, we can centralize error detection and recovery likewise.
> That ensures uniformity of implementation, and facilitates focused
> testing (or even formal proof) of that implementation.
>
> Can we gain the same benefits with a more generic design?  Perhaps.  It
> would require that the compounding translator know how to reverse each
> type of operation, so that it can do so after an error.  That’s
> feasible, though it does mean maintaining a stack of undo actions
> instead of a simple state.  It might also mean testing combinations and
> scenarios that will actually never occur in other components’ usage of
> the compounding feature.  More likely it means that people will *think*
> they can use the facility in unanticipated ways, until their
> unanticipated usage creates a combination or scenario that was never
> tested and doesn’t work.  Those are going to be hard problems to debug.
> I think it’s better to be explicit about which permutations we actually
> expect to work, and have those working earlier.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



Could we have a dry-run phase and a commit phase for the compound operation.
The dry-run phase phase could test the validity of the transaction and the
commit phase can actually perform the operation.

If any of the operation in the dry-run operation sequence returns error,
the compound operation can be aborted immediately without the complexity of
an undo ... scattered or centralized.

But if the subsequent operations depend on the changed state of the system
from earlier operations, then we'll have to introduce a system state object
for such transactions ... and maybe serialize such operations. The system
state object can be passed through the operation sequence. How well this
idea would work in a multi-threaded world is not clear to me too.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-11 Thread Pranith Kumar Karampuri



On 12/09/2015 11:48 PM, Pranith Kumar Karampuri wrote:



On 12/09/2015 08:11 PM, Shyam wrote:

On 12/09/2015 02:37 AM, Soumya Koduri wrote:



On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) 
wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by
compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound 
value.)

WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call 
futures
and promises.  It’s a good and well studied concept, which is 
necessary

to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here
scare
me too.  Wrapping up the arguments for one sub-operation in xdata 
for

another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within 
sub-operation

#1’s xdata.  There’s also not much clarity about how to handle
errors in
that model.  Encoding N sub-operations’ arguments in a linear 
structure
as Shyam proposes seems a bit cleaner that way.  If I were to 
continue
down that route I’d suggest just having start_compound and 
end-compound
fops, plus an extra field (or by-convention xdata key) that 
either the

client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach
that
avoids the need to solve what the database folks (who have dealt 
with
complex transactions for years) would tell us is a really hard 
problem.

Instead of designing for every case we can imagine, let’s design for
the
cases that we know would be useful for improving performance. 
Open plus

read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take 
care of
“forwarding” results from one sub-operation to the next. We could 
even
use GF_FOP_IPC to prototype this.  If we later find that the 
number of
“one-off” compound requests is growing too large, then at least 
we’ll
have some experience to guide our design of a more general 
alternative.
Right now, I think we’re trying to look further ahead than we can 
see

clearly.
Yes Agree. This makes implementation on the client side simpler as 
well.

So it is welcome.

Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new 
fops

to decode the RPC request then resolve_resume and
compound-op-receiver(Better name for this is welcome) which sends 
one op

after other and send compound fop response.


@Pranith, I assume you would expand on this at a later date 
(something along the lines of what Soumya has done below, right?


I will talk to her tomorrow to know more about this. Not saying this 
is what I will be implementing (There doesn't seem to be any consensus 
yet). But I would love to know how it is implemented.


Soumya and I had a discussion about this and it seems like the NFS way 
of stuffing the args seems to workout at a high level. Even the sentinel 
value based work may also be possible. What I will do now is to take a 
look at the structure deeply and work out how all the fops mentioned in 
this thread can be implemented. I will update you guys about my findings 
in a couple of days.


Pranith


Pranith




List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of
compound fops he has in mind?


 From the discussions we had with Niels regarding the kerberos support
on GlusterFS, I think below are the set of compound fops which are
required.

set_uid +
set_gid +
set_lkowner (or kerberos principal name) +
actual_fop

Also gfapi does lookup (first time/to refresh inode) before performing
actual fops most of the times. It may really help if we can club such
fops -


@Soumya +5 (just a random number :) )

This came to my mind as wel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Pranith Kumar Karampuri



On 12/09/2015 08:11 PM, Shyam wrote:

On 12/09/2015 02:37 AM, Soumya Koduri wrote:



On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) 
wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by
compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound 
value.)

WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is 
necessary

to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here
scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle
errors in
that model.  Encoding N sub-operations’ arguments in a linear 
structure
as Shyam proposes seems a bit cleaner that way.  If I were to 
continue
down that route I’d suggest just having start_compound and 
end-compound
fops, plus an extra field (or by-convention xdata key) that either 
the

client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach
that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard 
problem.

Instead of designing for every case we can imagine, let’s design for
the
cases that we know would be useful for improving performance. Open 
plus

read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take 
care of
“forwarding” results from one sub-operation to the next. We could 
even
use GF_FOP_IPC to prototype this.  If we later find that the 
number of

“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general 
alternative.

Right now, I think we’re trying to look further ahead than we can see
clearly.
Yes Agree. This makes implementation on the client side simpler as 
well.

So it is welcome.

Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new fops
to decode the RPC request then resolve_resume and
compound-op-receiver(Better name for this is welcome) which sends 
one op

after other and send compound fop response.


@Pranith, I assume you would expand on this at a later date (something 
along the lines of what Soumya has done below, right?


I will talk to her tomorrow to know more about this. Not saying this is 
what I will be implementing (There doesn't seem to be any consensus 
yet). But I would love to know how it is implemented.


Pranith




List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of
compound fops he has in mind?


 From the discussions we had with Niels regarding the kerberos support
on GlusterFS, I think below are the set of compound fops which are
required.

set_uid +
set_gid +
set_lkowner (or kerberos principal name) +
actual_fop

Also gfapi does lookup (first time/to refresh inode) before performing
actual fops most of the times. It may really help if we can club such
fops -


@Soumya +5 (just a random number :) )

This came to my mind as well, and is a good candidate for compounding.



LOOKUP + FOP (OPEN etc)

Coming to the design proposed, I agree with Shyam, Ira and Jeff's
thoughts. Defining different compound fops for each specific set of
operations and wrapping up those arguments in xdata seem rather complex
and difficult to maintain going further. Having being worked with NFS,
may I suggest why not we follow (or in similar lines)  the approach
being taken by NFS protocol to define and implem

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Pranith Kumar Karampuri



On 12/09/2015 08:08 PM, Shyam wrote:

On 12/09/2015 12:52 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 10:39 AM, Prashanth Pai wrote:
However, I’d be even more comfortable with an even simpler approach 
that

avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard 
problem.
Instead of designing for every case we can imagine, let’s design 
for the
cases that we know would be useful for improving performance.  Open 
plus

read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.

 From object interface (Swift/S3) perspective, this is the fop order
and flow for object operations:

GET: open(), fstat(), fgetxattr()s, read()s, close()

Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In
posix there is an implementation of GF_CONTENT_KEY which is used to read
a file in lookup by quick-read. This needs to be exposed for fds as well
I think. So you can do all this using fstat on anon-fd.

HEAD: stat(), getxattr()s

Krutika already implemented this for sharding
http://review.gluster.org/10158. You can do this using stat fop.


I believe we need to fork this part of the conversation, i.e the stat 
+ xattr information clubbing.


My view on a stat for gluster is, POSIX stat + gluster extended 
information being returned. I state this as, a file system when it 
stats its inode, should get all information regarding the inode, and 
not just the POSIX ones. In the case of other local FS, the inode 
structure has more fields than just what POSIX needs, so when the 
inode is *read* the FS can populate all its internal inode information 
and return to the application/syscall the relevant fields that it needs.


I believe gluster should do the same, so in the cases above, we should 
actually extend our stat information (not elaborating how) to include 
all information from the brick, i.e stat from POSIX and all the 
extended attrs for the inode (file or dir). This can then be consumed 
by any layer as needed.


Currently, each layer adds what it needs in addition to the stat 
information in the xdata, as an xattr request, this can continue or go 
away, if the relevant FOPs return the whole inode information upward.


This also has useful outcomes in readdirp calls, where we get the 
extended stat information for each entry.

You can use "list-xattr" in xdata request to get this.


With the patches referred to, and older patches, this seems to be the 
direction sought (around 2013), any reasons why this is not prevalent 
across the stack and made so? Or am I mistaken?
No reason. We can revive it. There didn't seem to be any interest. So I 
didn't follow up to get it in.


Pranith



PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

This I think should be a new compound fop. Nothing similar exists.

DELETE: getxattr(), unlink()

This can also be clubbed in unlink already because xdata exists on the
wire already.


Compounding some of these ops and exposing them as consumable libgfapi
APIs like glfs_get() and glfs_put() similar to librados compound
APIs[1] would greatly improve performance for object based access.

[1]:
https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219 




Thanks.

- Prashanth Pai


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Shyam

On 12/09/2015 09:32 AM, Jeff Darcy wrote:




On December 9, 2015 at 7:07:06 AM, Ira Cooper (i...@redhat.com) wrote:

A simple "abort on failure" and let the higher levels clean it up is
probably right for the type of compounding I propose. It is what SMB2
does. So, if you get an error return value, cancel the rest of the
request, and have it return ECOMPOUND as the errno.


This is exactly the part that worries me.  If a compound operation
fails, some parts of it will often need to be undone.  “Let the higher
levels clean it up” means that rollback code will be scattered among all
of the translators that use compound operations.  Some of them will do
it right.  Others . . . less so.  ;)  All willl have to be tested
separately.  If we centralize dispatch of compound operations into one
piece of code, we can centralize error detection and recovery likewise.
That ensures uniformity of implementation, and facilitates focused
testing (or even formal proof) of that implementation.


My take on this, is whichever layer started the compounding takes into 
account the error handling. I do not see any requirement for undoing 
things that are done, and would almost say (without further thought 
(that's the gunslinger in me talking ;) )) that this is not supported as 
a part of the compounding.




Can we gain the same benefits with a more generic design?  Perhaps.  It
would require that the compounding translator know how to reverse each
type of operation, so that it can do so after an error.  That’s
feasible, though it does mean maintaining a stack of undo actions
instead of a simple state.  It might also mean testing combinations and
scenarios that will actually never occur in other components’ usage of
the compounding feature.  More likely it means that people will *think*
they can use the facility in unanticipated ways, until their
unanticipated usage creates a combination or scenario that was never
tested and doesn’t work.  Those are going to be hard problems to debug.
I think it’s better to be explicit about which permutations we actually
expect to work, and have those working earlier.


Jeff, a clarification, are you suggesting fop_xxx extensions for each 
compound operation supported?

Or,
Suggesting a *single* FOP, that carries compounded requests, but is 
specific about what requests can be compounded? (for example, allows 
open+write, but when building out the compound request, disallows *say* 
anything else)


(If any doubt, I am with the latter and not so gaga about the former as 
it explodes the FOP list)


Also, I think the compound list has exploded (in this mail conversation) 
and provided a lot of compounding requests... I would say this means we 
need a clear way of doing the latter.



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



P.S: Ignore this...
gunslinger: "a man who carries a gun and shoots well." I claim to be 
neither... just stating

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Rajesh Joseph


- Original Message -
> From: "Ira Cooper" 
> To: "Jeff Darcy" , "Raghavendra Gowdappa" 
> , "Pranith Kumar Karampuri"
> 
> Cc: "Gluster Devel" 
> Sent: Wednesday, December 9, 2015 5:37:05 PM
> Subject: Re: [Gluster-devel] compound fop design first cut
> 
> Jeff Darcy  writes:
> 
> > However, I’d be even more comfortable with an even simpler approach that
> > avoids the need to solve what the database folks (who have dealt with
> > complex transactions for years) would tell us is a really hard problem.
> > Instead of designing for every case we can imagine, let’s design for the
> > cases that we know would be useful for improving performance.  Open plus
> > read/write plus close is an obvious one.  Raghavendra mentions
> > create+inodelk as well.  For each of those, we can easily define a
> > structure that contains the necessary fields, we don’t need a
> > client-side translator, and the server-side translator can take care of
> > “forwarding” results from one sub-operation to the next.  We could even
> > use GF_FOP_IPC to prototype this.  If we later find that the number of
> > “one-off” compound requests is growing too large, then at least we’ll
> > have some experience to guide our design of a more general alternative.
> > Right now, I think we’re trying to look further ahead than we can see
> > clearly.
> 
> Actually, I'm taking the design, I've seen another network protocol use,
> SMB2, and proposing it here, I'd be shocked if NFS doesn't behave in the
> same way.
> 
> Interestingly, all the cases, really deal with a single file, and a
> single lock, and a single...
> 
> There's a reason I talked about a single sentinel value, and not
> multiple ones.  Because I wanted to keep it simple.  Yes, the extensions
> you mention are obvious, but they lead to a giant mess, that we may not
> want initially.  (But that we CAN extend into if we want them.  I made
> the choice not to go there because honestly, I found the complexity too
> much for me.)
> 
> A simple "abort on failure" and let the higher levels clean it up is
> probably right for the type of compounding I propose.  It is what SMB2
> does.  So, if you get an error return value, cancel the rest of the
> request, and have it return ECOMPOUND as the errno.
> 
> Note: How you keep the list to be compounded doesn't matter much to me.
> the semantics matter, because those are what I can ask for later, and
> allow us to create ops the original desginers hadn't thought of, which
> is usually the hallmark of a good design.
> 
> I think you should look for a simple design you can "grow into" instead
> of creating one off ops, to satisfy a demand today.
> 

I agree with Ira here. This problem is already addressed by NFS and SMB.
So instead of reinventing the wheel lets pick the best bits from these
solutions and incorporate in Gluster.

From multi-protocol point of view we like to compound operations like
open + set_leaseID + lk and many more. With the current approach it would 
be really messy to have separate functions for each such combinations and a 
dedicated translator to handle them.

As others have mentioned I think it would be better to have a general
fop (fop_compound) which can handle compound fop. Each translator can
choose to implement it or not. Each translator can take a decision 
whether to compound more fops or de-compound them. e.g. currently
you can make the protocol server de-compound all the compound fops.

-Rajesh

> My thoughts,
> 
> -Ira
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Shyam

On 12/09/2015 02:37 AM, Soumya Koduri wrote:



On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by
compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is necessary
to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here
scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle
errors in
that model.  Encoding N sub-operations’ arguments in a linear structure
as Shyam proposes seems a bit cleaner that way.  If I were to continue
down that route I’d suggest just having start_compound and end-compound
fops, plus an extra field (or by-convention xdata key) that either the
client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach
that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for
the
cases that we know would be useful for improving performance. Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take care of
“forwarding” results from one sub-operation to the next.  We could even
use GF_FOP_IPC to prototype this.  If we later find that the number of
“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general alternative.
Right now, I think we’re trying to look further ahead than we can see
clearly.

Yes Agree. This makes implementation on the client side simpler as well.
So it is welcome.

Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new fops
to decode the RPC request then resolve_resume and
compound-op-receiver(Better name for this is welcome) which sends one op
after other and send compound fop response.


@Pranith, I assume you would expand on this at a later date (something 
along the lines of what Soumya has done below, right?




List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of
compound fops he has in mind?


 From the discussions we had with Niels regarding the kerberos support
on GlusterFS, I think below are the set of compound fops which are
required.

set_uid +
set_gid +
set_lkowner (or kerberos principal name) +
actual_fop

Also gfapi does lookup (first time/to refresh inode) before performing
actual fops most of the times. It may really help if we can club such
fops -


@Soumya +5 (just a random number :) )

This came to my mind as well, and is a good candidate for compounding.



LOOKUP + FOP (OPEN etc)

Coming to the design proposed, I agree with Shyam, Ira and Jeff's
thoughts. Defining different compound fops for each specific set of
operations and wrapping up those arguments in xdata seem rather complex
and difficult to maintain going further. Having being worked with NFS,
may I suggest why not we follow (or in similar lines)  the approach
being taken by NFS protocol to define and implement compound procedures.

The basic structure of the NFS COMPOUND procedure is:

+-+--++---+---+---+--
| tag | minorversion | numops | op + args | op + args | op + args |
+-+--++

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Shyam

On 12/09/2015 12:52 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 10:39 AM, Prashanth Pai wrote:

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance.  Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.

 From object interface (Swift/S3) perspective, this is the fop order
and flow for object operations:

GET: open(), fstat(), fgetxattr()s, read()s, close()

Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In
posix there is an implementation of GF_CONTENT_KEY which is used to read
a file in lookup by quick-read. This needs to be exposed for fds as well
I think. So you can do all this using fstat on anon-fd.

HEAD: stat(), getxattr()s

Krutika already implemented this for sharding
http://review.gluster.org/10158. You can do this using stat fop.


I believe we need to fork this part of the conversation, i.e the stat + 
xattr information clubbing.


My view on a stat for gluster is, POSIX stat + gluster extended 
information being returned. I state this as, a file system when it stats 
its inode, should get all information regarding the inode, and not just 
the POSIX ones. In the case of other local FS, the inode structure has 
more fields than just what POSIX needs, so when the inode is *read* the 
FS can populate all its internal inode information and return to the 
application/syscall the relevant fields that it needs.


I believe gluster should do the same, so in the cases above, we should 
actually extend our stat information (not elaborating how) to include 
all information from the brick, i.e stat from POSIX and all the extended 
attrs for the inode (file or dir). This can then be consumed by any 
layer as needed.


Currently, each layer adds what it needs in addition to the stat 
information in the xdata, as an xattr request, this can continue or go 
away, if the relevant FOPs return the whole inode information upward.


This also has useful outcomes in readdirp calls, where we get the 
extended stat information for each entry.


With the patches referred to, and older patches, this seems to be the 
direction sought (around 2013), any reasons why this is not prevalent 
across the stack and made so? Or am I mistaken?



PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

This I think should be a new compound fop. Nothing similar exists.

DELETE: getxattr(), unlink()

This can also be clubbed in unlink already because xdata exists on the
wire already.


Compounding some of these ops and exposing them as consumable libgfapi
APIs like glfs_get() and glfs_put() similar to librados compound
APIs[1] would greatly improve performance for object based access.

[1]:
https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219


Thanks.

- Prashanth Pai


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Poornima Gurusiddaiah
libgfapi compound fops added inline.

- Original Message -
> From: "Kotresh Hiremath Ravishankar" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Gluster Devel" 
> Sent: Wednesday, December 9, 2015 2:18:47 PM
> Subject: Re: [Gluster-devel] compound fop design first cut
> 
> Geo-rep requirements inline.
> 
> Thanks and Regards,
> Kotresh H R
> 
> - Original Message -
> > From: "Pranith Kumar Karampuri" 
> > To: "Vijay Bellur" , "Jeff Darcy" ,
> > "Raghavendra Gowdappa"
> > , "Ira Cooper" 
> > Cc: "Gluster Devel" 
> > Sent: Wednesday, December 9, 2015 11:44:52 AM
> > Subject: Re: [Gluster-devel] compound fop design first cut
> > 
> > 
> > 
> > On 12/09/2015 06:37 AM, Vijay Bellur wrote:
> > > On 12/08/2015 03:45 PM, Jeff Darcy wrote:
> > >>
> > >>
> > >>
> > >> On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:
> > >>> Raghavendra Gowdappa writes:
> > >>> I propose that we define a "compound op" that contains ops.
> > >>>
> > >>> Within each op, there are fields that can be "inherited" from the
> > >>> previous op, via use of a sentinel value.
> > >>>
> > >>> Sentinel is -1, for all of these examples.
> > >>>
> > >>> So:
> > >>>
> > >>> LOOKUP (1, "foo") (Sets the gfid value to be picked up by
> > >>> compounding, 1
> > >>> is the root directory, as a gfid, by convention.)
> > >>> OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
> > >>> WRITE(-1, "foo", 3) (Uses the glfd compound value.)
> > >>> CLOSE(-1) (Uses the glfd compound value)
> > >>
> > >> So, basically, what the programming-language types would call futures
> > >> and promises.  It’s a good and well studied concept, which is necessary
> > >> to solve the second-order problem of how to specify an argument in
> > >> sub-operation N+1 that’s not known until sub-operation N completes.
> > >>
> > >> To be honest, some of the highly general approaches suggested here scare
> > >> me too.  Wrapping up the arguments for one sub-operation in xdata for
> > >> another would get pretty hairy if we ever try to go beyond two
> > >> sub-operations and have to nest sub-operation #3’s args within
> > >> sub-operation #2’s xdata which is itself encoded within sub-operation
> > >> #1’s xdata.  There’s also not much clarity about how to handle errors in
> > >> that model.  Encoding N sub-operations’ arguments in a linear structure
> > >> as Shyam proposes seems a bit cleaner that way.  If I were to continue
> > >> down that route I’d suggest just having start_compound and end-compound
> > >> fops, plus an extra field (or by-convention xdata key) that either the
> > >> client-side or server-side translator could use to build whatever
> > >> structure it wants and schedule sub-operations however it wants.
> > >>
> > >> However, I’d be even more comfortable with an even simpler approach that
> > >> avoids the need to solve what the database folks (who have dealt with
> > >> complex transactions for years) would tell us is a really hard problem.
> > >> Instead of designing for every case we can imagine, let’s design for the
> > >> cases that we know would be useful for improving performance. Open plus
> > >> read/write plus close is an obvious one.  Raghavendra mentions
> > >> create+inodelk as well.  For each of those, we can easily define a
> > >> structure that contains the necessary fields, we don’t need a
> > >> client-side translator, and the server-side translator can take care of
> > >> “forwarding” results from one sub-operation to the next.  We could even
> > >> use GF_FOP_IPC to prototype this.  If we later find that the number of
> > >> “one-off” compound requests is growing too large, then at least we’ll
> > >> have some experience to guide our design of a more general alternative.
> > >> Right now, I think we’re trying to look further ahead than we can see
> > >> clearly.
> > Yes Agree. This makes implementation on the client side simpler as well.
> > So it is welcome.
> > 
> > Just updating the solution.
> > 1) New RPCs are going to be implemented.
>

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Jeff Darcy



On December 9, 2015 at 7:07:06 AM, Ira Cooper (i...@redhat.com) wrote:
> A simple "abort on failure" and let the higher levels clean it up is
> probably right for the type of compounding I propose. It is what SMB2
> does. So, if you get an error return value, cancel the rest of the
> request, and have it return ECOMPOUND as the errno.

This is exactly the part that worries me.  If a compound operation
fails, some parts of it will often need to be undone.  “Let the higher
levels clean it up” means that rollback code will be scattered among all
of the translators that use compound operations.  Some of them will do
it right.  Others . . . less so.  ;)  All willl have to be tested
separately.  If we centralize dispatch of compound operations into one
piece of code, we can centralize error detection and recovery likewise.
That ensures uniformity of implementation, and facilitates focused
testing (or even formal proof) of that implementation.

Can we gain the same benefits with a more generic design?  Perhaps.  It
would require that the compounding translator know how to reverse each
type of operation, so that it can do so after an error.  That’s
feasible, though it does mean maintaining a stack of undo actions
instead of a simple state.  It might also mean testing combinations and
scenarios that will actually never occur in other components’ usage of
the compounding feature.  More likely it means that people will *think*
they can use the facility in unanticipated ways, until their
unanticipated usage creates a combination or scenario that was never
tested and doesn’t work.  Those are going to be hard problems to debug.
I think it’s better to be explicit about which permutations we actually
expect to work, and have those working earlier.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Ira Cooper
Jeff Darcy  writes:

> However, I’d be even more comfortable with an even simpler approach that
> avoids the need to solve what the database folks (who have dealt with
> complex transactions for years) would tell us is a really hard problem.
> Instead of designing for every case we can imagine, let’s design for the
> cases that we know would be useful for improving performance.  Open plus
> read/write plus close is an obvious one.  Raghavendra mentions
> create+inodelk as well.  For each of those, we can easily define a
> structure that contains the necessary fields, we don’t need a
> client-side translator, and the server-side translator can take care of
> “forwarding” results from one sub-operation to the next.  We could even
> use GF_FOP_IPC to prototype this.  If we later find that the number of
> “one-off” compound requests is growing too large, then at least we’ll
> have some experience to guide our design of a more general alternative.
> Right now, I think we’re trying to look further ahead than we can see
> clearly.

Actually, I'm taking the design, I've seen another network protocol use,
SMB2, and proposing it here, I'd be shocked if NFS doesn't behave in the
same way.

Interestingly, all the cases, really deal with a single file, and a
single lock, and a single...

There's a reason I talked about a single sentinel value, and not
multiple ones.  Because I wanted to keep it simple.  Yes, the extensions
you mention are obvious, but they lead to a giant mess, that we may not
want initially.  (But that we CAN extend into if we want them.  I made
the choice not to go there because honestly, I found the complexity too
much for me.)

A simple "abort on failure" and let the higher levels clean it up is
probably right for the type of compounding I propose.  It is what SMB2
does.  So, if you get an error return value, cancel the rest of the
request, and have it return ECOMPOUND as the errno.

Note: How you keep the list to be compounded doesn't matter much to me.
the semantics matter, because those are what I can ask for later, and
allow us to create ops the original desginers hadn't thought of, which
is usually the hallmark of a good design.

I think you should look for a simple design you can "grow into" instead
of creating one off ops, to satisfy a demand today.

My thoughts,

-Ira
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Kotresh Hiremath Ravishankar
Geo-rep requirements inline.

Thanks and Regards,
Kotresh H R

- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Vijay Bellur" , "Jeff Darcy" , 
> "Raghavendra Gowdappa"
> , "Ira Cooper" 
> Cc: "Gluster Devel" 
> Sent: Wednesday, December 9, 2015 11:44:52 AM
> Subject: Re: [Gluster-devel] compound fop design first cut
> 
> 
> 
> On 12/09/2015 06:37 AM, Vijay Bellur wrote:
> > On 12/08/2015 03:45 PM, Jeff Darcy wrote:
> >>
> >>
> >>
> >> On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:
> >>> Raghavendra Gowdappa writes:
> >>> I propose that we define a "compound op" that contains ops.
> >>>
> >>> Within each op, there are fields that can be "inherited" from the
> >>> previous op, via use of a sentinel value.
> >>>
> >>> Sentinel is -1, for all of these examples.
> >>>
> >>> So:
> >>>
> >>> LOOKUP (1, "foo") (Sets the gfid value to be picked up by
> >>> compounding, 1
> >>> is the root directory, as a gfid, by convention.)
> >>> OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
> >>> WRITE(-1, "foo", 3) (Uses the glfd compound value.)
> >>> CLOSE(-1) (Uses the glfd compound value)
> >>
> >> So, basically, what the programming-language types would call futures
> >> and promises.  It’s a good and well studied concept, which is necessary
> >> to solve the second-order problem of how to specify an argument in
> >> sub-operation N+1 that’s not known until sub-operation N completes.
> >>
> >> To be honest, some of the highly general approaches suggested here scare
> >> me too.  Wrapping up the arguments for one sub-operation in xdata for
> >> another would get pretty hairy if we ever try to go beyond two
> >> sub-operations and have to nest sub-operation #3’s args within
> >> sub-operation #2’s xdata which is itself encoded within sub-operation
> >> #1’s xdata.  There’s also not much clarity about how to handle errors in
> >> that model.  Encoding N sub-operations’ arguments in a linear structure
> >> as Shyam proposes seems a bit cleaner that way.  If I were to continue
> >> down that route I’d suggest just having start_compound and end-compound
> >> fops, plus an extra field (or by-convention xdata key) that either the
> >> client-side or server-side translator could use to build whatever
> >> structure it wants and schedule sub-operations however it wants.
> >>
> >> However, I’d be even more comfortable with an even simpler approach that
> >> avoids the need to solve what the database folks (who have dealt with
> >> complex transactions for years) would tell us is a really hard problem.
> >> Instead of designing for every case we can imagine, let’s design for the
> >> cases that we know would be useful for improving performance. Open plus
> >> read/write plus close is an obvious one.  Raghavendra mentions
> >> create+inodelk as well.  For each of those, we can easily define a
> >> structure that contains the necessary fields, we don’t need a
> >> client-side translator, and the server-side translator can take care of
> >> “forwarding” results from one sub-operation to the next.  We could even
> >> use GF_FOP_IPC to prototype this.  If we later find that the number of
> >> “one-off” compound requests is growing too large, then at least we’ll
> >> have some experience to guide our design of a more general alternative.
> >> Right now, I think we’re trying to look further ahead than we can see
> >> clearly.
> Yes Agree. This makes implementation on the client side simpler as well.
> So it is welcome.
> 
> Just updating the solution.
> 1) New RPCs are going to be implemented.
> 2) client stack will use these new fops.
> 3) On the server side we have server xlator implementing these new fops
> to decode the RPC request then resolve_resume and
> compound-op-receiver(Better name for this is welcome) which sends one op
> after other and send compound fop response.
> 
> List of compound fops identified so far:
> Swift/S3:
> PUT: creat(), write()s, setxattr(), fsync(), close(), rename()
> 
> Dht:
> mkdir + inodelk
> 
> Afr:
> xattrop+writev, xattrop+unlock to begin with.

  Geo-rep:
  mknod,entrylk,stat(on backend gfid)
  mkdir,entrylk,stat (on backend gfid)
  symlink,entrylk,stat(on backend

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Soumya Koduri



On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by
compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is necessary
to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle errors in
that model.  Encoding N sub-operations’ arguments in a linear structure
as Shyam proposes seems a bit cleaner that way.  If I were to continue
down that route I’d suggest just having start_compound and end-compound
fops, plus an extra field (or by-convention xdata key) that either the
client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance. Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take care of
“forwarding” results from one sub-operation to the next.  We could even
use GF_FOP_IPC to prototype this.  If we later find that the number of
“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general alternative.
Right now, I think we’re trying to look further ahead than we can see
clearly.

Yes Agree. This makes implementation on the client side simpler as well.
So it is welcome.

Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new fops
to decode the RPC request then resolve_resume and
compound-op-receiver(Better name for this is welcome) which sends one op
after other and send compound fop response.

List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of
compound fops he has in mind?

From the discussions we had with Niels regarding the kerberos support 
on GlusterFS, I think below are the set of compound fops which are required.


set_uid +
set_gid +
set_lkowner (or kerberos principal name) +
actual_fop

Also gfapi does lookup (first time/to refresh inode) before performing 
actual fops most of the times. It may really help if we can club such fops -


LOOKUP + FOP (OPEN etc)

Coming to the design proposed, I agree with Shyam, Ira and Jeff's 
thoughts. Defining different compound fops for each specific set of 
operations and wrapping up those arguments in xdata seem rather complex 
and difficult to maintain going further. Having being worked with NFS, 
may I suggest why not we follow (or in similar lines)  the approach 
being taken by NFS protocol to define and implement compound procedures.


   The basic structure of the NFS COMPOUND procedure is:

   +-+--++---+---+---+--
   | tag | minorversion | numops | op + args | op + args | op + args |
   +-+--++---+---+---+--

   and the reply's structure is:

  ++-++---+--
  |last status | tag | numres | status + op + results |
  ++-++---+--

Each compound procedure will con

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Pranith Kumar Karampuri



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by 
compounding, 1

is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is necessary
to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle errors in
that model.  Encoding N sub-operations’ arguments in a linear structure
as Shyam proposes seems a bit cleaner that way.  If I were to continue
down that route I’d suggest just having start_compound and end-compound
fops, plus an extra field (or by-convention xdata key) that either the
client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance. Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take care of
“forwarding” results from one sub-operation to the next.  We could even
use GF_FOP_IPC to prototype this.  If we later find that the number of
“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general alternative.
Right now, I think we’re trying to look further ahead than we can see
clearly.
Yes Agree. This makes implementation on the client side simpler as well. 
So it is welcome.


Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new fops 
to decode the RPC request then resolve_resume and 
compound-op-receiver(Better name for this is welcome) which sends one op 
after other and send compound fop response.


List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of 
compound fops he has in mind?


Pranith.


Starting with a well defined set of operations for compounding has its 
advantages. It would be easier to understand and maintain correctness 
across the stack. Some of our translators perform transactions & 
create/update internal metadata for certain fops. It would be easier 
for such translators if the compound operations are well defined and 
does not entail deep introspection of a generic representation to 
ensure that the right behavior gets reflected at the end of a compound 
operation.


-Vijay





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Pranith Kumar Karampuri



On 12/09/2015 10:39 AM, Prashanth Pai wrote:
  

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance.  Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.

 From object interface (Swift/S3) perspective, this is the fop order and flow 
for object operations:

GET: open(), fstat(), fgetxattr()s, read()s, close()
Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In 
posix there is an implementation of GF_CONTENT_KEY which is used to read 
a file in lookup by quick-read. This needs to be exposed for fds as well 
I think. So you can do all this using fstat on anon-fd.

HEAD: stat(), getxattr()s
Krutika already implemented this for sharding 
http://review.gluster.org/10158. You can do this using stat fop.

PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

This I think should be a new compound fop. Nothing similar exists.

DELETE: getxattr(), unlink()
This can also be clubbed in unlink already because xdata exists on the 
wire already.


Compounding some of these ops and exposing them as consumable libgfapi APIs 
like glfs_get() and glfs_put() similar to librados compound APIs[1] would 
greatly improve performance for object based access.

[1]: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219

Thanks.

- Prashanth Pai


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Prashanth Pai
 
> However, I’d be even more comfortable with an even simpler approach that
> avoids the need to solve what the database folks (who have dealt with
> complex transactions for years) would tell us is a really hard problem.
> Instead of designing for every case we can imagine, let’s design for the
> cases that we know would be useful for improving performance.  Open plus
> read/write plus close is an obvious one.  Raghavendra mentions
> create+inodelk as well.

From object interface (Swift/S3) perspective, this is the fop order and flow 
for object operations:

GET: open(), fstat(), fgetxattr()s, read()s, close()
HEAD: stat(), getxattr()s
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()
DELETE: getxattr(), unlink()

Compounding some of these ops and exposing them as consumable libgfapi APIs 
like glfs_get() and glfs_put() similar to librados compound APIs[1] would 
greatly improve performance for object based access.

[1]: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219

Thanks.

- Prashanth Pai
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Vijay Bellur

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is necessary
to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle errors in
that model.  Encoding N sub-operations’ arguments in a linear structure
as Shyam proposes seems a bit cleaner that way.  If I were to continue
down that route I’d suggest just having start_compound and end-compound
fops, plus an extra field (or by-convention xdata key) that either the
client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance.  Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take care of
“forwarding” results from one sub-operation to the next.  We could even
use GF_FOP_IPC to prototype this.  If we later find that the number of
“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general alternative.
Right now, I think we’re trying to look further ahead than we can see
clearly.


Starting with a well defined set of operations for compounding has its 
advantages. It would be easier to understand and maintain correctness 
across the stack. Some of our translators perform transactions & 
create/update internal metadata for certain fops. It would be easier for 
such translators if the compound operations are well defined and does 
not entail deep introspection of a generic representation to ensure that 
the right behavior gets reflected at the end of a compound operation.


-Vijay



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Jeff Darcy



On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:
> Raghavendra Gowdappa writes:
> I propose that we define a "compound op" that contains ops.
>  
> Within each op, there are fields that can be "inherited" from the
> previous op, via use of a sentinel value.
>  
> Sentinel is -1, for all of these examples.
>  
> So:
>  
> LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1
> is the root directory, as a gfid, by convention.)
> OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
> WRITE(-1, "foo", 3) (Uses the glfd compound value.)
> CLOSE(-1) (Uses the glfd compound value)

So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is necessary
to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle errors in
that model.  Encoding N sub-operations’ arguments in a linear structure
as Shyam proposes seems a bit cleaner that way.  If I were to continue
down that route I’d suggest just having start_compound and end-compound
fops, plus an extra field (or by-convention xdata key) that either the
client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance.  Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take care of
“forwarding” results from one sub-operation to the next.  We could even
use GF_FOP_IPC to prototype this.  If we later find that the number of
“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general alternative.
Right now, I think we’re trying to look further ahead than we can see
clearly.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Ira Cooper
Raghavendra Gowdappa  writes:

> From what I can see, new compound ops will _evolve_ in future based on 
> requirements unseen as of now.

Yes,

That is the one thing you can count on here ;)

The compounding architecture proposed here, scares me to be honest.

The complexity it can create is pretty immense.

I'm going to suggest a simpler scheme to you, there is no API provided,
but I think you'll see the idea, which is the key.  Then APIs and
whatnot can follow.

In the proposal today, if I want to compound op A and B, I have to write
compound_A_and_B basically.

That will create headaches for anyone who comes up with neat ideas :(.
Or needs to do longer and longer compounds.

I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)

Note, that we can define what fields can take "sentinel" values, so
things like leases, locks etc, can all be handled properly.

The other trick is, if we return an error, we MUST stop the compound,
and return the rest of the return results as ECOMPOUND or some similar
value.  The actual thing that errored should return proper error codes.

Now, the cute thing about this is that a translator can look at a compound
stream, element by element, and decide what to do with it, or if you
need to break the stream and handle the compound semantics yourself or
what.

So this actually fits well with gluster's architecture, of being very
composable :).

I'm interested in your thoughts on where the edges of this proposal may
be, and if it meets your needs.

Thanks,

-Ira
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] compound fop design first cut

2015-12-07 Thread Raghavendra Gowdappa
> 
> On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On 12/08/2015 02:53 AM, Shyam wrote:
> >> Hi,
> >>
> >> Why not think along the lines of new FOPs like fop_compound(_cbk)
> >> where, the inargs to this FOP is a list of FOPs to execute (either in
> >> order or any order)?
> > That is the intent. The question is how do we specify the fops that we
> > want to do and the arguments to the fop. In this approach, for example
> > xl_fxattrop_writev() is a new FOP. List of fops that need to be done
> > are fxattrop, writev in that order and the arguments are a union of
> > the arguments needed to perform the fops fxattrop, writev. The reason
> > why this fop is not implemented through out the graph is to not change
> > most of the stack on the brick side in the first cut of the
> > implementation. i.e. quota/barrier/geo-rep/io-threads
> > priorities/bit-rot may have to implement these new compund fops. We
> > still get the benefit of avoiding the network round trips.
> >>
> >> With a scheme like the above we could,
> >>  - compound any set of FOPs (of course, we need to take care here,
> >> but still the feasibility exists)
> > It still exists but the fop space will be blown for each of the
> > combination.
> >>  - Each xlator can inspect the compound relation and chose to
> >> uncompound them. So if an xlator cannot perform FOPA+B as a single
> >> compound FOP, it can choose to send FOPA and then FOPB and chain up
> >> the responses back to the compound request sent to it. Also, the
> >> intention here would be to leverage existing FOP code in any xlator,
> >> to appropriately modify the inargs
> >>  - The RPC payload is constructed based on existing FOP RPC
> >> definitions, but compounded based on the compound FOP RPC definition
> > This will be done in phase-3 after learning a bit more about how best
> > to implement it to prevent stuffing arguments in xdata in future as
> > much as possible. After which we can choose to retire
> > compound-fop-sender and receiver xlators.
> >>
> >> Possibly on the brick graph as well, pass these down as compounded
> >> FOPs, till someone decides to break it open and do it in phases
> >> (ultimately POSIX xlator).
> > This will be done in phase-2. At the moment we are not giving any
> > choice for the xlators on the brick side.
> >>
> >> The intention would be to break a compound FOP in case an xlator in
> >> between cannot support it or, even expand a compound FOP request, say
> >> the fxattropAndWrite is an AFR compounding decision, but a compound
> >> request to AFR maybe WriteandClose, hence AFR needs to extend this
> >> compound request.
> > Yes. There was a discussion with krutika where if shard wants to do
> > write then xattrop in a single fop, then we need dht to implement
> > dht_writev_fxattrop which should look somewhat similar to
> > dht_writev(), and afr will need to implement afr_writev_fxattrop() as
> > full blown transaction where it needs to take data+metadata domain
> > locks then do data+metadata pre-op then wind to
> > compound_fop_sender_writev_fxattrop() and then data+metadata post-op
> > then unlocks.
> >
> > If we were to do writev, fxattrop separately, fops will be (In
> > unoptimized case):
> > 1) finodelk for write
> > 2) fxattrop for preop of write.
> > 3) write
> > 4) fxattrop for post op of write
> > 5) unlock for write
> > 6) finodelk for fxattrop
> > 7) fxattrop for preop of shard-fxattrop
> > 8) shard-fxattrop
> > 9) fxattrop for post op of shard fxattrop
> > 10) unlock forfxattrop
> >
> > If AFR chooses to implement writev_fxattrop: means data+metadata
> > transaction.
> > 1) finodelk in data, metadata domain simultaneously (just like we take
> > multiple locks in rename)
> > 2) preop for data, metadata parts as part of the compound fop
> > 3) writev+fxattrop
> > 4)postop for data, metadata parts as part of the compound fop
> > 5) unlocks simultaneously.
> >
> > So it is still 2x reduction of the number of network fops except for
> > may be locking.
> >>
> >> The above is just a off the cuff thought on the same.
> > We need to arrive at a consensus about how to specify the list of fops
> > and their arguments. The reason why I went against list_of_fops is to
> > make discovery of possibile optimizations we can do easier per
> > compound fop (Inspired by ec's implementation of multiplications by
> > all possible elements in the Galois field, where multiplication with
> > different number has a different optimization). Could you elaborate
> > more about the idea you have about list_of_fops and its arguments? May
> > be we can come up with combinations of fops where we can employ this
> > technique of just list_of_fops and wind. I think rest of the solutions
> > you mentioned is where it will converge towards over time. Intention
> > is to avoid network round trips without waiting for the whole stack to
> > change as much as possible.
> May be I am over thinking it. Not a lot of combinations could be
> transactions. In any 

Re: [Gluster-devel] compound fop design first cut

2015-12-07 Thread Pranith Kumar Karampuri



On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote:



On 12/08/2015 02:53 AM, Shyam wrote:

Hi,

Why not think along the lines of new FOPs like fop_compound(_cbk) 
where, the inargs to this FOP is a list of FOPs to execute (either in 
order or any order)?
That is the intent. The question is how do we specify the fops that we 
want to do and the arguments to the fop. In this approach, for example 
xl_fxattrop_writev() is a new FOP. List of fops that need to be done 
are fxattrop, writev in that order and the arguments are a union of 
the arguments needed to perform the fops fxattrop, writev. The reason 
why this fop is not implemented through out the graph is to not change 
most of the stack on the brick side in the first cut of the 
implementation. i.e. quota/barrier/geo-rep/io-threads 
priorities/bit-rot may have to implement these new compund fops. We 
still get the benefit of avoiding the network round trips.


With a scheme like the above we could,
 - compound any set of FOPs (of course, we need to take care here, 
but still the feasibility exists)
It still exists but the fop space will be blown for each of the 
combination.
 - Each xlator can inspect the compound relation and chose to 
uncompound them. So if an xlator cannot perform FOPA+B as a single 
compound FOP, it can choose to send FOPA and then FOPB and chain up 
the responses back to the compound request sent to it. Also, the 
intention here would be to leverage existing FOP code in any xlator, 
to appropriately modify the inargs
 - The RPC payload is constructed based on existing FOP RPC 
definitions, but compounded based on the compound FOP RPC definition
This will be done in phase-3 after learning a bit more about how best 
to implement it to prevent stuffing arguments in xdata in future as 
much as possible. After which we can choose to retire 
compound-fop-sender and receiver xlators.


Possibly on the brick graph as well, pass these down as compounded 
FOPs, till someone decides to break it open and do it in phases 
(ultimately POSIX xlator).
This will be done in phase-2. At the moment we are not giving any 
choice for the xlators on the brick side.


The intention would be to break a compound FOP in case an xlator in 
between cannot support it or, even expand a compound FOP request, say 
the fxattropAndWrite is an AFR compounding decision, but a compound 
request to AFR maybe WriteandClose, hence AFR needs to extend this 
compound request.
Yes. There was a discussion with krutika where if shard wants to do 
write then xattrop in a single fop, then we need dht to implement 
dht_writev_fxattrop which should look somewhat similar to 
dht_writev(), and afr will need to implement afr_writev_fxattrop() as 
full blown transaction where it needs to take data+metadata domain 
locks then do data+metadata pre-op then wind to 
compound_fop_sender_writev_fxattrop() and then data+metadata post-op 
then unlocks.


If we were to do writev, fxattrop separately, fops will be (In 
unoptimized case):

1) finodelk for write
2) fxattrop for preop of write.
3) write
4) fxattrop for post op of write
5) unlock for write
6) finodelk for fxattrop
7) fxattrop for preop of shard-fxattrop
8) shard-fxattrop
9) fxattrop for post op of shard fxattrop
10) unlock forfxattrop

If AFR chooses to implement writev_fxattrop: means data+metadata 
transaction.
1) finodelk in data, metadata domain simultaneously (just like we take 
multiple locks in rename)

2) preop for data, metadata parts as part of the compound fop
3) writev+fxattrop
4)postop for data, metadata parts as part of the compound fop
5) unlocks simultaneously.

So it is still 2x reduction of the number of network fops except for 
may be locking.


The above is just a off the cuff thought on the same.
We need to arrive at a consensus about how to specify the list of fops 
and their arguments. The reason why I went against list_of_fops is to 
make discovery of possibile optimizations we can do easier per 
compound fop (Inspired by ec's implementation of multiplications by 
all possible elements in the Galois field, where multiplication with 
different number has a different optimization). Could you elaborate 
more about the idea you have about list_of_fops and its arguments? May 
be we can come up with combinations of fops where we can employ this 
technique of just list_of_fops and wind. I think rest of the solutions 
you mentioned is where it will converge towards over time. Intention 
is to avoid network round trips without waiting for the whole stack to 
change as much as possible.
May be I am over thinking it. Not a lot of combinations could be 
transactions. In any case do let me know what you have in mind.




Pranith


The scheme below seems too specific to my eyes, and looks like we 
would be defining specific compound FOPs than the ability to have 
generic ones.


On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote:

hi,

Draft of the design doc:

Main motivation for the design of this feature is 

Re: [Gluster-devel] compound fop design first cut

2015-12-07 Thread Pranith Kumar Karampuri



On 12/08/2015 02:53 AM, Shyam wrote:

Hi,

Why not think along the lines of new FOPs like fop_compound(_cbk) 
where, the inargs to this FOP is a list of FOPs to execute (either in 
order or any order)?
That is the intent. The question is how do we specify the fops that we 
want to do and the arguments to the fop. In this approach, for example 
xl_fxattrop_writev() is a new FOP. List of fops that need to be done are 
fxattrop, writev in that order and the arguments are a union of the 
arguments needed to perform the fops fxattrop, writev. The reason why 
this fop is not implemented through out the graph is to not change most 
of the stack on the brick side in the first cut of the implementation. 
i.e. quota/barrier/geo-rep/io-threads priorities/bit-rot may have to 
implement these new compund fops. We still get the benefit of avoiding 
the network round trips.


With a scheme like the above we could,
 - compound any set of FOPs (of course, we need to take care here, but 
still the feasibility exists)

It still exists but the fop space will be blown for each of the combination.
 - Each xlator can inspect the compound relation and chose to 
uncompound them. So if an xlator cannot perform FOPA+B as a single 
compound FOP, it can choose to send FOPA and then FOPB and chain up 
the responses back to the compound request sent to it. Also, the 
intention here would be to leverage existing FOP code in any xlator, 
to appropriately modify the inargs
 - The RPC payload is constructed based on existing FOP RPC 
definitions, but compounded based on the compound FOP RPC definition
This will be done in phase-3 after learning a bit more about how best to 
implement it to prevent stuffing arguments in xdata in future as much as 
possible. After which we can choose to retire compound-fop-sender and 
receiver xlators.


Possibly on the brick graph as well, pass these down as compounded 
FOPs, till someone decides to break it open and do it in phases 
(ultimately POSIX xlator).
This will be done in phase-2. At the moment we are not giving any choice 
for the xlators on the brick side.


The intention would be to break a compound FOP in case an xlator in 
between cannot support it or, even expand a compound FOP request, say 
the fxattropAndWrite is an AFR compounding decision, but a compound 
request to AFR maybe WriteandClose, hence AFR needs to extend this 
compound request.
Yes. There was a discussion with krutika where if shard wants to do 
write then xattrop in a single fop, then we need dht to implement 
dht_writev_fxattrop which should look somewhat similar to dht_writev(), 
and afr will need to implement afr_writev_fxattrop() as full blown 
transaction where it needs to take data+metadata domain locks then do 
data+metadata pre-op then wind to compound_fop_sender_writev_fxattrop() 
and then data+metadata post-op then unlocks.


If we were to do writev, fxattrop separately, fops will be (In 
unoptimized case):

1) finodelk for write
2) fxattrop for preop of write.
3) write
4) fxattrop for post op of write
5) unlock for write
6) finodelk for fxattrop
7) fxattrop for preop of shard-fxattrop
8) shard-fxattrop
9) fxattrop for post op of shard fxattrop
10) unlock forfxattrop

If AFR chooses to implement writev_fxattrop: means data+metadata 
transaction.
1) finodelk in data, metadata domain simultaneously (just like we take 
multiple locks in rename)

2) preop for data, metadata parts as part of the compound fop
3) writev+fxattrop
4)postop for data, metadata parts as part of the compound fop
5) unlocks simultaneously.

So it is still 2x reduction of the number of network fops except for may 
be locking.


The above is just a off the cuff thought on the same.
We need to arrive at a consensus about how to specify the list of fops 
and their arguments. The reason why I went against list_of_fops is to 
make discovery of possibile optimizations we can do easier per compound 
fop (Inspired by ec's implementation of multiplications by all possible 
elements in the Galois field, where multiplication with different number 
has a different optimization). Could you elaborate more about the idea 
you have about list_of_fops and its arguments? May be we can come up 
with combinations of fops where we can employ this technique of just 
list_of_fops and wind. I think rest of the solutions you mentioned is 
where it will converge towards over time. Intention is to avoid network 
round trips without waiting for the whole stack to change as much as 
possible.


Pranith


The scheme below seems too specific to my eyes, and looks like we 
would be defining specific compound FOPs than the ability to have 
generic ones.


On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote:

hi,

Draft of the design doc:

Main motivation for the design of this feature is to reduce network
round trips by sending more
than one fop in a network operation, preferably without introducing new
rpcs.

There are new 2 new xlators compound-fop-sender, compound-fop-rece

Re: [Gluster-devel] compound fop design first cut

2015-12-07 Thread Shyam

Hi,

Why not think along the lines of new FOPs like fop_compound(_cbk) where, 
the inargs to this FOP is a list of FOPs to execute (either in order or 
any order)?


With a scheme like the above we could,
 - compound any set of FOPs (of course, we need to take care here, but 
still the feasibility exists)
 - Each xlator can inspect the compound relation and chose to 
uncompound them. So if an xlator cannot perform FOPA+B as a single 
compound FOP, it can choose to send FOPA and then FOPB and chain up the 
responses back to the compound request sent to it. Also, the intention 
here would be to leverage existing FOP code in any xlator, to 
appropriately modify the inargs
 - The RPC payload is constructed based on existing FOP RPC 
definitions, but compounded based on the compound FOP RPC definition


Possibly on the brick graph as well, pass these down as compounded FOPs, 
till someone decides to break it open and do it in phases (ultimately 
POSIX xlator).


The intention would be to break a compound FOP in case an xlator in 
between cannot support it or, even expand a compound FOP request, say 
the fxattropAndWrite is an AFR compounding decision, but a compound 
request to AFR maybe WriteandClose, hence AFR needs to extend this 
compound request.


The above is just a off the cuff thought on the same.

The scheme below seems too specific to my eyes, and looks like we would 
be defining specific compound FOPs than the ability to have generic ones.


On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote:

hi,

Draft of the design doc:

Main motivation for the design of this feature is to reduce network
round trips by sending more
than one fop in a network operation, preferably without introducing new
rpcs.

There are new 2 new xlators compound-fop-sender, compound-fop-receiver.
compound-fop-sender is going to be loaded on top of each client-xlator
on the
mount/client and compound-fop-receiver is going to be loaded below
server-xlator on the bricks. On the mount/client side from the caller
xlator
till compund-fop-encoder xlator, the xlators can choose to implement
this extra
compound fop handling. Once it reaches "compound-fop-sender" it will try to
choose a base fop on which it encodes the other fop in the base-fop's
xdata,
and winds the base fop to client xlator(). client xlator sends the base fop
with encoded xdata to server xlator on the brick using rpc of the base fop.
Once server xlator does resolve_and_resume() it will wind the base fop to
compound-fop-receiver xlator. This fop will decode the extra fop from
xdata of
the base-fop. Based on the order encoded in the xdata it executes
separate fops
one after the other and stores the cbk response arguments of both the
operations. It again encodes the response of the extra fop on to the
base fop's
response xdata and unwind the fop to server xlator. Sends the response
using
base-rpc's response structure. Client xlator will unwind the base fop to
compound-fop-sender, which will decode the response to the compound fop's
response arguments of the compound fop and unwind to the parent xlators.

I will take an example of fxattrop+write operation that we want to
implement in
afr as an example to explain how things may look.

compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this,
fd_t * fd,
 gf_xattrop_flags_t flags,
 dict_t * fxattrop_dict,
 dict_t * fxattrop_xdata,
 struct iovec * vector,
 int32_t count,
 off_t off,
 uint32_t flags,
 struct iobref * iobref,
 dict_t * writev_xdata)
) {
 0) Remember the compound-fop
 take base-fop as write()
 in wriev_xdata add the following key,value pairs
 1) "xattrop-flags", flags
 2) for-each-fxattrop_dict key -> "fxattrop-dict-",
value
 3) for-each-fxattrop_xdata key ->
"fxattrop-xdata-", value
 4) "order" -> "fxattrop, writev"
 5) "compound-fops" -> "fxattrop"
 6) Wind writev()
}

compound_fop_sender_fxattrop_write_cbk(...)
{
 /*decode the response args and call parent_fxattrop_write_cbk*/
}

_fxattrop_write_cbk (call_frame_t *frame,
void *cookie,
 xlator_t *this, int32_t
fxattrop_op_ret,
 int32_t fxattrop_op_errno,
 dict_t *fxattrop_dict,
 dict_t *fxattrop_xdata,
 int32_t writev_op_ret, int32_t
writev_op_errno,
 struct iatt *writev_prebuf,
 struct iatt *writev_postbuf,
 dict_t *writev_xdata)
{
/**/
}

compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this, fd_t *
fd,
 struct iovec * vector,
 int32_t count,
 off_t off,
 uint32_t flags,
 struct iobref * iobref,
 dict_t * writev_xdata)
{
  

[Gluster-devel] compound fop design first cut

2015-12-07 Thread Pranith Kumar Karampuri

hi,

Draft of the design doc:

Main motivation for the design of this feature is to reduce network 
round trips by sending more
than one fop in a network operation, preferably without introducing new 
rpcs.


There are new 2 new xlators compound-fop-sender, compound-fop-receiver.
compound-fop-sender is going to be loaded on top of each client-xlator 
on the

mount/client and compound-fop-receiver is going to be loaded below
server-xlator on the bricks. On the mount/client side from the caller 
xlator
till compund-fop-encoder xlator, the xlators can choose to implement 
this extra

compound fop handling. Once it reaches "compound-fop-sender" it will try to
choose a base fop on which it encodes the other fop in the base-fop's 
xdata,

and winds the base fop to client xlator(). client xlator sends the base fop
with encoded xdata to server xlator on the brick using rpc of the base fop.
Once server xlator does resolve_and_resume() it will wind the base fop to
compound-fop-receiver xlator. This fop will decode the extra fop from 
xdata of
the base-fop. Based on the order encoded in the xdata it executes 
separate fops

one after the other and stores the cbk response arguments of both the
operations. It again encodes the response of the extra fop on to the 
base fop's
response xdata and unwind the fop to server xlator. Sends the response 
using

base-rpc's response structure. Client xlator will unwind the base fop to
compound-fop-sender, which will decode the response to the compound fop's
response arguments of the compound fop and unwind to the parent xlators.

I will take an example of fxattrop+write operation that we want to 
implement in

afr as an example to explain how things may look.

compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this, 
fd_t * fd,

gf_xattrop_flags_t flags,
dict_t * fxattrop_dict,
dict_t * fxattrop_xdata,
struct iovec * vector,
int32_t count,
off_t off,
uint32_t flags,
struct iobref * iobref,
dict_t * writev_xdata)
) {
0) Remember the compound-fop
take base-fop as write()
in wriev_xdata add the following key,value pairs
1) "xattrop-flags", flags
2) for-each-fxattrop_dict key -> "fxattrop-dict-", 
value
3) for-each-fxattrop_xdata key -> 
"fxattrop-xdata-", value

4) "order" -> "fxattrop, writev"
5) "compound-fops" -> "fxattrop"
6) Wind writev()
}

compound_fop_sender_fxattrop_write_cbk(...)
{
/*decode the response args and call parent_fxattrop_write_cbk*/
}

_fxattrop_write_cbk (call_frame_t *frame, 
void *cookie,
xlator_t *this, int32_t 
fxattrop_op_ret,

int32_t fxattrop_op_errno,
dict_t *fxattrop_dict,
dict_t *fxattrop_xdata,
int32_t writev_op_ret, int32_t 
writev_op_errno,

struct iatt *writev_prebuf,
struct iatt *writev_postbuf,
dict_t *writev_xdata)
{
/**/
}

compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this, fd_t * 
fd,

struct iovec * vector,
int32_t count,
off_t off,
uint32_t flags,
struct iobref * iobref,
dict_t * writev_xdata)
{
0) Check if writev_xdata has "compound-fop" else default_writev()
2) decode writev_xdata from above encoding -> flags, 
fxattrop_dict, fxattrop-xdata

3) get "order"
4) Store all the above in 'local'
5) wind fxattrop() with 
compound_receiver_fxattrop_cbk_writev_wind() as cbk

}

compound_receiver_fxattrop_cbk_writev_wind (call_frame_t *frame, void 
*cookie,
xlator_t *this, int32_t 
op_ret,
int32_t op_errno, dict_t 
*dict,

dict_t *xdata)
{
0) store fxattrop cbk_args
1) Perform writev() with writev_params with 
compound_receiver_writev_cbk() as the 'cbk'

}

compound_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
 int32_t op_ret, int32_t op_errno, struct iatt 
*prebuf,

 struct iatt *postbuf, dict_t *xdata)
{
0) store writev cbk_args
1) Encode fxattrop response to writev_xdata with similar 
encoding in the compound_fop_sender_fxattrop_write()

2) unwind writev()
}

This example is just to show how things may look, but the actual 
implementation
may just have all base-fops calling common function to perform the 
operations
in the order given in the receriver xl. Yet to think about that. It is 
probably better to Encode
fop-number from glusterfs_fop_t rather than the fop-string in the 
dictionary.


This is phase-1 of the change because we don't wa