Re: [Gluster-devel] logs/cores for smoke failures

2016-09-26 Thread Pranith Kumar Karampuri
On Tue, Sep 27, 2016 at 11:20 AM, Nigel Babu  wrote:

> These are gbench failures rather than smoke failures. If you know how to
> debug dbench failures, please add comments on the bug and I'll get you the
> logs you need.
>

Oh, we can't archive the logs like we do for regression runs?


>
> On Tue, Sep 27, 2016 at 9:40 AM, Ravishankar N 
> wrote:
>
>> On 09/27/2016 09:36 AM, Pranith Kumar Karampuri wrote:
>>
>> hi Nigel,
>>   Is there already a bug to capture these in the runs when failures
>> happen? I am not able to understand why this failure happened:
>> https://build.gluster.org/job/smoke/30843/console, logs/cores would have
>> helped. Let me know if I should raise a bug for this.
>>
>> I raised one y'day: https://bugzilla.redhat.com/show_bug.cgi?id=1379228
>> -Ravi
>>
>>
>> --
>> Pranith
>>
>>
>> ___
>> Gluster-devel mailing 
>> listGluster-devel@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>>
>>
>>
>
>
> --
> nigelb
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-26 Thread Ric Wheeler

On 09/27/2016 08:53 AM, Raghavendra Gowdappa wrote:


- Original Message -

From: "Ric Wheeler" 
To: "Raghavendra Gowdappa" , "Saravanakumar Arumugam" 

Cc: "Gluster Devel" , "Ben Turner" , 
"Ben England"

Sent: Tuesday, September 27, 2016 10:51:48 AM
Subject: Re: [Gluster-devel] libgfapi zero copy write - application in samba, 
nfs-ganesha

On 09/27/2016 07:56 AM, Raghavendra Gowdappa wrote:

+Manoj, +Ben turner, +Ben England.

@Perf-team,

Do you think the gains are significant enough, so that smb and nfs-ganesha
team can start thinking about consuming this change?

regards,
Raghavendra

This is a large gain but I think that we might see even larger gains (a lot
depends on how we implement copy offload :)).

Can you elaborate on what you mean "copy offload"? If it is the way we avoid a 
copy in gfapi (from application buffer), following is the workflow:



Work flow of zero copy write operation:
--

1) Application requests a buffer of specific size. A new buffer is
allocated from iobuf pool, and this buffer is passed on to application.
Achieved using "glfs_get_buffer"

2) Application writes into the received buffer, and passes that to
libgfapi, and libgfapi in turn passes the same buffer to underlying
translators. This avoids a memcpy in glfs write
Achieved using "glfs_zero_write"

3) Once the write operation is complete, Application must take the
responsibilty of freeing the buffer.
Achieved using "glfs_free_buffer"



Do you've any suggestions/improvements on this? I think Shyam mentioned an 
alternative approach (for zero-copy readv I think), let me look up at that too.

regards,
Raghavendra


Both NFS and SMB support a copy offload that allows a client to produce a new 
copy of a file without bringing data over the wire. Both, if I remember 
correctly, do a ranged copy within a file.


The key here is that since the data does not move over the wire from server to 
client, we can shift the performance bottleneck to the storage server.


If we have a slow (1GB) link between client and server, we should be able to do 
that copy as if it happened just on the server itself. For a single NFS server 
(not a clustered, scale out server), that usually means we are as fast as the 
local file system copy.


Note that there are also servers that simply "reflink" that file, so we have a 
very small amount of time needed on the server to produce that copy.  This can 
be a huge win for say a copy of a virtual machine guest image.


Gluster and other distributed servers won't benefit as much as a local server 
would I suspect because of the need to do things internally over our networks 
between storage server nodes.


Hope that makes my thoughts clearer?

Here is a link to a brief overview of the new Linux system call:

https://kernelnewbies.org/Linux_4.5#head-6df3d298d8e0afa8e85e1125cc54d5f13b9a0d8c

Note that block devices or pseudo devices can also implement a copy offload.

Regards,

Ric




Worth looking at how we can make use of it.

thanks!

Ric


- Original Message -

From: "Saravanakumar Arumugam" 
To: "Gluster Devel" 
Sent: Monday, September 26, 2016 7:18:26 PM
Subject: [Gluster-devel] libgfapi zero copy write - application in samba,
nfs-ganesha

Hi,

I have carried out "basic" performance measurement with zero copy write
APIs.
Throughput of zero copy write is 57 MB/sec vs default write 43 MB/sec.
( I have modified Ben England's gfapi_perf_test.c for this. Attached the
same
for reference )

We would like to hear how samba/ nfs-ganesha who are libgfapi users can
make
use of this.
Please provide your comments. Refer attached results.

Zero copy in write patch: http://review.gluster.org/#/c/14784/

Thanks,
Saravana



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-26 Thread Raghavendra G
+sachin.

On Tue, Sep 27, 2016 at 11:23 AM, Raghavendra Gowdappa 
wrote:

>
>
> - Original Message -
> > From: "Ric Wheeler" 
> > To: "Raghavendra Gowdappa" , "Saravanakumar
> Arumugam" 
> > Cc: "Gluster Devel" , "Ben Turner" <
> btur...@redhat.com>, "Ben England"
> > 
> > Sent: Tuesday, September 27, 2016 10:51:48 AM
> > Subject: Re: [Gluster-devel] libgfapi zero copy write - application in
> samba, nfs-ganesha
> >
> > On 09/27/2016 07:56 AM, Raghavendra Gowdappa wrote:
> > > +Manoj, +Ben turner, +Ben England.
> > >
> > > @Perf-team,
> > >
> > > Do you think the gains are significant enough, so that smb and
> nfs-ganesha
> > > team can start thinking about consuming this change?
> > >
> > > regards,
> > > Raghavendra
> >
> > This is a large gain but I think that we might see even larger gains (a
> lot
> > depends on how we implement copy offload :)).
>
> Can you elaborate on what you mean "copy offload"? If it is the way we
> avoid a copy in gfapi (from application buffer), following is the workflow:
>
> 
>
> Work flow of zero copy write operation:
> --
>
> 1) Application requests a buffer of specific size. A new buffer is
> allocated from iobuf pool, and this buffer is passed on to application.
>Achieved using "glfs_get_buffer"
>
> 2) Application writes into the received buffer, and passes that to
> libgfapi, and libgfapi in turn passes the same buffer to underlying
> translators. This avoids a memcpy in glfs write
>Achieved using "glfs_zero_write"
>
> 3) Once the write operation is complete, Application must take the
> responsibilty of freeing the buffer.
>Achieved using "glfs_free_buffer"
>
> 
>
> Do you've any suggestions/improvements on this? I think Shyam mentioned an
> alternative approach (for zero-copy readv I think), let me look up at that
> too.
>
> regards,
> Raghavendra
>
> >
> > Worth looking at how we can make use of it.
> >
> > thanks!
> >
> > Ric
> >
> > >
> > > - Original Message -
> > >> From: "Saravanakumar Arumugam" 
> > >> To: "Gluster Devel" 
> > >> Sent: Monday, September 26, 2016 7:18:26 PM
> > >> Subject: [Gluster-devel] libgfapi zero copy write - application in
> samba,
> > >>nfs-ganesha
> > >>
> > >> Hi,
> > >>
> > >> I have carried out "basic" performance measurement with zero copy
> write
> > >> APIs.
> > >> Throughput of zero copy write is 57 MB/sec vs default write 43 MB/sec.
> > >> ( I have modified Ben England's gfapi_perf_test.c for this. Attached
> the
> > >> same
> > >> for reference )
> > >>
> > >> We would like to hear how samba/ nfs-ganesha who are libgfapi users
> can
> > >> make
> > >> use of this.
> > >> Please provide your comments. Refer attached results.
> > >>
> > >> Zero copy in write patch: http://review.gluster.org/#/c/14784/
> > >>
> > >> Thanks,
> > >> Saravana
> >
> >
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-26 Thread Raghavendra Gowdappa


- Original Message -
> From: "Ric Wheeler" 
> To: "Raghavendra Gowdappa" , "Saravanakumar Arumugam" 
> 
> Cc: "Gluster Devel" , "Ben Turner" 
> , "Ben England"
> 
> Sent: Tuesday, September 27, 2016 10:51:48 AM
> Subject: Re: [Gluster-devel] libgfapi zero copy write - application in samba, 
> nfs-ganesha
> 
> On 09/27/2016 07:56 AM, Raghavendra Gowdappa wrote:
> > +Manoj, +Ben turner, +Ben England.
> >
> > @Perf-team,
> >
> > Do you think the gains are significant enough, so that smb and nfs-ganesha
> > team can start thinking about consuming this change?
> >
> > regards,
> > Raghavendra
> 
> This is a large gain but I think that we might see even larger gains (a lot
> depends on how we implement copy offload :)).

Can you elaborate on what you mean "copy offload"? If it is the way we avoid a 
copy in gfapi (from application buffer), following is the workflow:



Work flow of zero copy write operation:
--

1) Application requests a buffer of specific size. A new buffer is
allocated from iobuf pool, and this buffer is passed on to application.
   Achieved using "glfs_get_buffer"

2) Application writes into the received buffer, and passes that to
libgfapi, and libgfapi in turn passes the same buffer to underlying
translators. This avoids a memcpy in glfs write
   Achieved using "glfs_zero_write"

3) Once the write operation is complete, Application must take the
responsibilty of freeing the buffer.
   Achieved using "glfs_free_buffer"



Do you've any suggestions/improvements on this? I think Shyam mentioned an 
alternative approach (for zero-copy readv I think), let me look up at that too.

regards,
Raghavendra

> 
> Worth looking at how we can make use of it.
> 
> thanks!
> 
> Ric
> 
> >
> > - Original Message -
> >> From: "Saravanakumar Arumugam" 
> >> To: "Gluster Devel" 
> >> Sent: Monday, September 26, 2016 7:18:26 PM
> >> Subject: [Gluster-devel] libgfapi zero copy write - application in samba,
> >>nfs-ganesha
> >>
> >> Hi,
> >>
> >> I have carried out "basic" performance measurement with zero copy write
> >> APIs.
> >> Throughput of zero copy write is 57 MB/sec vs default write 43 MB/sec.
> >> ( I have modified Ben England's gfapi_perf_test.c for this. Attached the
> >> same
> >> for reference )
> >>
> >> We would like to hear how samba/ nfs-ganesha who are libgfapi users can
> >> make
> >> use of this.
> >> Please provide your comments. Refer attached results.
> >>
> >> Zero copy in write patch: http://review.gluster.org/#/c/14784/
> >>
> >> Thanks,
> >> Saravana
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] logs/cores for smoke failures

2016-09-26 Thread Nigel Babu
These are gbench failures rather than smoke failures. If you know how to
debug dbench failures, please add comments on the bug and I'll get you the
logs you need.

On Tue, Sep 27, 2016 at 9:40 AM, Ravishankar N 
wrote:

> On 09/27/2016 09:36 AM, Pranith Kumar Karampuri wrote:
>
> hi Nigel,
>   Is there already a bug to capture these in the runs when failures
> happen? I am not able to understand why this failure happened:
> https://build.gluster.org/job/smoke/30843/console, logs/cores would have
> helped. Let me know if I should raise a bug for this.
>
> I raised one y'day: https://bugzilla.redhat.com/show_bug.cgi?id=1379228
> -Ravi
>
>
> --
> Pranith
>
>
> ___
> Gluster-devel mailing 
> listGluster-devel@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-26 Thread Ric Wheeler

On 09/27/2016 07:56 AM, Raghavendra Gowdappa wrote:

+Manoj, +Ben turner, +Ben England.

@Perf-team,

Do you think the gains are significant enough, so that smb and nfs-ganesha team 
can start thinking about consuming this change?

regards,
Raghavendra


This is a large gain but I think that we might see even larger gains (a lot 
depends on how we implement copy offload :)).


Worth looking at how we can make use of it.

thanks!

Ric



- Original Message -

From: "Saravanakumar Arumugam" 
To: "Gluster Devel" 
Sent: Monday, September 26, 2016 7:18:26 PM
Subject: [Gluster-devel] libgfapi zero copy write - application in samba,   
nfs-ganesha

Hi,

I have carried out "basic" performance measurement with zero copy write APIs.
Throughput of zero copy write is 57 MB/sec vs default write 43 MB/sec.
( I have modified Ben England's gfapi_perf_test.c for this. Attached the same
for reference )

We would like to hear how samba/ nfs-ganesha who are libgfapi users can make
use of this.
Please provide your comments. Refer attached results.

Zero copy in write patch: http://review.gluster.org/#/c/14784/

Thanks,
Saravana


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-26 Thread Raghavendra Gowdappa
+Manoj, +Ben turner, +Ben England.

@Perf-team,

Do you think the gains are significant enough, so that smb and nfs-ganesha team 
can start thinking about consuming this change?

regards,
Raghavendra

- Original Message -
> From: "Saravanakumar Arumugam" 
> To: "Gluster Devel" 
> Sent: Monday, September 26, 2016 7:18:26 PM
> Subject: [Gluster-devel] libgfapi zero copy write - application in samba, 
> nfs-ganesha
> 
> Hi,
> 
> I have carried out "basic" performance measurement with zero copy write APIs.
> Throughput of zero copy write is 57 MB/sec vs default write 43 MB/sec.
> ( I have modified Ben England's gfapi_perf_test.c for this. Attached the same
> for reference )
> 
> We would like to hear how samba/ nfs-ganesha who are libgfapi users can make
> use of this.
> Please provide your comments. Refer attached results.
> 
> Zero copy in write patch: http://review.gluster.org/#/c/14784/
> 
> Thanks,
> Saravana
> 
> 
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] logs/cores for smoke failures

2016-09-26 Thread Ravishankar N

On 09/27/2016 09:36 AM, Pranith Kumar Karampuri wrote:

hi Nigel,
  Is there already a bug to capture these in the runs when 
failures happen? I am not able to understand why this failure 
happened: https://build.gluster.org/job/smoke/30843/console, 
logs/cores would have helped. Let me know if I should raise a bug for 
this.

I raised one y'day: https://bugzilla.redhat.com/show_bug.cgi?id=1379228
-Ravi


--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] logs/cores for smoke failures

2016-09-26 Thread Pranith Kumar Karampuri
hi Nigel,
  Is there already a bug to capture these in the runs when failures
happen? I am not able to understand why this failure happened:
https://build.gluster.org/job/smoke/30843/console, logs/cores would have
helped. Let me know if I should raise a bug for this.

-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Fixes for spurious failures in open-behind.t

2016-09-26 Thread Pranith Kumar Karampuri
hi,
I found the following two issues and fixed them:

Problems:
1) flush-behind is on by default, so just because write completes
doesn't mean
   it will be on the disk, it could still be in write-behind's cache.
This
   leads to failure where if you write from one mount and expect it to
be there
   on the other mount, sometimes it won't be there.
2) Sometimes the graph switch is not completing by the time we issue
read which
   is leading to opens not being sent on brick leading to failures.

Fixes:
1) Disable flush-behind
2) Add new functions to check the new graph is there and connected to
bricks
   before 'cat' is executed.

Check bz: 1379511 for more info.

Please let me know if you still face any failures after this. I removed it
from being bad test.

-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-26 Thread Ryan Ding

> On Sep 23, 2016, at 8:59 PM, Jeff Darcy  wrote:
> 
>>> write-behind: implement causal ordering and other cleanup
>> 
>>> Rules of causal ordering implemented:¬
>> 
>>> - If request A arrives after the acknowledgement (to the app,¬
>> 
>>> i.e, STACK_UNWIND) of another request B, then request B is¬
>> 
>>> said to have 'caused' request A.¬
>> 
>> 
>> With the above principle, two write requests (p1 and p2 in example above)
>> issued by _two different threads/processes_ there need _not always_ be a
>> 'causal' relationship (whether there is a causal relationship is purely
>> based on the "chance" that write-behind chose to ack one/both of them and
>> their timing of arrival).
> 
> I think this is an issue of terminology.  While it's not *certain* that B
> (or p1) caused A (or p2), it's *possible*.  Contrast with the case where
> they overlap, which could not possibly happen if the application were
> trying to ensure order.  In the distributed-system literature, this is
> often referred to as a causal relationship even though it's really just
> the possibility of one, because in most cases even the possibility means
> that reordering would be unacceptable.
> 
>> So, current write-behind is agnostic to the
>> ordering of p1 and p2 (when done by two threads).
>> 
>> However if p1 and p2 are issued by same thread there is _always_ a causal
>> relationship (p2 being caused by p1).
> 
> See above.  If we feel bound to respect causal relationships, we have to
> be pessimistic and assume that wherever such a relationship *could* exist
> it *does* exist.  However, as I explained in my previous message, I don't
> think it's practical to provide such a guarantee across multiple clients,
> and if we don't provide it across multiple clients then it's not worth
> much to provide it on a single client.  Applications that require such
> strict ordering shouldn't use write-behind, or should explicitly flush
> between writes.  Otherwise they'll break unexpectedly when parts are
> distributed across multiple nodes.  Assuming that everything runs on one
> node is the same mistake POSIX makes.  The assumption was appropriate
> for an earlier era, but not now for a decade or more.

We can separate this into 2 question:
1. should it be a causal relationship in local application ?
2. should it be a causal relationship in a distribute application ?
I think the answer to #2 is ’NO’. this is an issue that distribute application 
should resolve. the way to resolve it is either use distribute lock we provided 
or use their own way (fsync is required in such condition).
I think the answer to #1 is ‘YES’. because buffer io should not involve new 
data consistency problem than no-buffer io. it’s very common that a local 
application will assume underlying file system to be.
further more, compatible to linux page cache will always to be a better 
practice way, because there is a lot local applications that has already rely 
on its semantics.

Thanks,
Ryan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-26 Thread Saravanakumar Arumugam

Hi,

I have carried out "basic" performance measurement with zero copy write 
APIs.

Throughput of zero copy write  is 57 MB/sec  vs default write 43 MB/sec.*
*( I have modified Ben England's gfapi_perf_test.c for this. Attached 
the same for reference )


We would like to hear how samba/ nfs-ganesha who are libgfapi users can 
make use of this.

Please provide your comments. Refer attached results.

Zero copy in write patch: http://review.gluster.org/#/c/14784/

Thanks,
Saravana


ZERO COPY write:

[root@gfvm3 parallel-libgfapi]# sync; echo 3 > /proc/sys/vm/drop_caches 
[root@gfvm3 parallel-libgfapi]# 
[root@gfvm3 parallel-libgfapi]# 
root@gfvm3 parallel-libgfapi]# DEBUG=0  GFAPI_VOLNAME=tv1 GFAPI_FSZ=1048576  
GFAPI_FILES=3   GFAPI_HOSTNAME=gfvm3 GFAPI_BASEDIR=gluster_tmp 
./gfapi_perf_test
prm.debug: 0
GLUSTER: 
  volume=tv1
  transport=tcp
  host=gfvm3
  port=24007
  fuse?No
  trace level=0
  start timeout=60
WORKLOAD:
  type = seq-wr 
  threads/proc = 1
  base directory = gluster_tmp
  prefix=f
  file size = 1048576 KB
  file count = 3
  record size = 64 KB
  files/dir=1000
  fsync-at-close? No 
zero copy writezero copy writezero copy writethread   0:   files written = 3
  files done = 3
  I/O (record) transfers = 49152
  total bytes = 3221225472
  elapsed time= 53.74 sec
  throughput  = 57.16 MB/sec
  IOPS= 914.58(sequential write)
aggregate:   files written = 3
  files done = 3
  I/O (record) transfers = 49152
  total bytes = 3221225472
  elapsed time= 53.74 sec
  throughput  = 57.16 MB/sec
  IOPS= 914.58(sequential write)
[root@gfvm3 parallel-libgfapi]# 

Default write: 

[root@gfvm3 parallel-libgfapi]# sync; echo 3 > /proc/sys/vm/drop_caches 

[root@gfvm3 parallel-libgfapi]# DEBUG=0  GFAPI_VOLNAME=tv1 GFAPI_FSZ=1048576  
GFAPI_FILES=3   GFAPI_HOSTNAME=gfvm3 GFAPI_BASEDIR=gluster_tmp 
./gfapi_perf_test
prm.debug: 0
GLUSTER: 
  volume=tv1
  transport=tcp
  host=gfvm3
  port=24007
  fuse?No
  trace level=0
  start timeout=60
WORKLOAD:
  type = seq-wr 
  threads/proc = 1
  base directory = gluster_tmp
  prefix=f
  file size = 1048576 KB
  file count = 3
  record size = 64 KB
  files/dir=1000
  fsync-at-close? No 
thread   0:   files written = 3
  files done = 3
  I/O (record) transfers = 49152
  total bytes = 3221225472
  elapsed time= 70.00 sec
  throughput  = 43.89 MB/sec
  IOPS= 702.19(sequential write)
aggregate:   files written = 3
  files done = 3
  I/O (record) transfers = 49152
  total bytes = 3221225472
  elapsed time= 70.00 sec
  throughput  = 43.89 MB/sec
  IOPS= 702.19(sequential write)
[root@gfvm3 parallel-libgfapi]# 



/*
 * gfapi_perf_test.c - single-thread test of Gluster libgfapi file perf, 
enhanced to do small files also
 *
 * install the glusterfs-api RPM before trying to compile and link
 *
 * to compile: gcc -pthread -g -O0  -Wall --pedantic -o gfapi_perf_test -I 
/usr/include/glusterfs/api gfapi_perf_test.c  -lgfapi -lrt
 *
 * environment variables used as inputs, see usage() below
 *
 * NOTE: we allow random workloads to process a fraction of the entire file
 * this allows us to generate a file that will not fit in cache 
 * we then can do random I/O on a fraction of the data in that file, unlike 
iozone
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "glfs.h"

#define NOTOK 1 /* process exit status indicates error of some sort */
#define OK 0/* system call or process exit status indicating success */
#define KB_PER_MB 1024
#define BYTES_PER_KB 1024
#define BYTES_PER_MB (1024*1024)
#define KB_PER_MB 1024
#define NSEC_PER_SEC 10.0
#define UINT64DFMT "%ld"

/* power of 2 corresponding to 4096-byte page boundary, used in memalign() call 
*/
#define PAGE_BOUNDARY 12 

#define FOREACH(_index, _count) for(_index=0; _index < (_count); _index++)

/* last array element of workload_types must be NULL */
static const char * workload_types[] = 
   { "seq-wr", "seq-rd", "rnd-wr", "rnd-rd", "unlink", "seq-rdwrmix", NULL };
static const char * workload_description[] = 
   { "sequential write", "sequential read", "random write", "random read", 
"delete", "sequential read-write mix", NULL };
/* define numeric workload types as indexes into preceding array */
#define WL_SEQWR 0
#define WL_SEQRD 1
#define WL_RNDWR 2
#define WL_RNDRD 3
#define WL_DELETE 4
#define WL_SEQRDWRMIX 5

static glfs_t * glfs_p = NULL;

/* shared parameter values common to all threads */

struct gfapi_prm {
  int threads_per_proc;/* threads spawned within each process */
  char * workload_str; /* name of workload to run */
  int workload_type;   /* post-parse numeric code for workload - 
contains WL_something */
  unsigned usec_delay_per_file;/* microseconds of dela

Re: [Gluster-devel] [Gluster-infra] Migration complete

2016-09-26 Thread Jeff Darcy
> Michael and I are happy to announce that the migration is now complete.

Thank you both for all of your hard work.  :)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Fixing setfsuid/gid problems in posix xlator

2016-09-26 Thread Pranith Kumar Karampuri
On Mon, Sep 26, 2016 at 4:49 PM, Niels de Vos  wrote:

> On Fri, Sep 23, 2016 at 08:44:14PM +0530, Pranith Kumar Karampuri wrote:
> > On Fri, Sep 23, 2016 at 6:12 PM, Jeff Darcy  wrote:
> >
> > > > Jiffin found an interesting problem in posix xlator where we have
> never
> > > been
> > > > using setfsuid/gid ( http://review.gluster.org/#/c/15545/ ), what I
> am
> > > > seeing regressions after this is, if the files are created using
> non-root
> > > > user then the file creation fails because that user doesn't have
> > > permissions
> > > > to create the gfid-link. So it seems like the correct way forward for
> > > this
> > > > patch is to write wrappers around sys_ to do setfsuid/gid
> do the
> > > > actual operation requested and then set it back to old uid/gid and
> then
> > > do
> > > > the internal operations. I am planning to write
> posix_sys_() to
> > > do
> > > > the same, may be a macro?
> > >
> > > Kind of an aside, but I'd prefer to see a lot fewer macros in our code.
> > > They're not type-safe, and multi-line macros often mess up line
> numbers for
> > > debugging or error messages.  IMO it's better to use functions whenever
> > > possible, and usually to let the compiler worry about how/when to
> inline.
> > >
> > > > I need inputs from you guys to let me know if I am on the right path
> and
> > > if
> > > > you see any issues with this approach.
> > >
> > > I think there's a bit of an interface problem here.  The sys_xxx
> wrappers
> > > don't have arguments that point to the current frame, so how would
> they get
> > > the correct uid/gid?  We could add arguments to each function, but then
> > > we'd have to modify every call.  This includes internal calls which
> don't
> > > have a frame to pass, so I guess they'd have to pass NULL.
> Alternatively,
> > > we could create a parallel set of functions with frame pointers.
> Contrary
> > > to what I just said above, this might be a case where macros make
> sense:
> > >
> > >int
> > >sys_writev_fp (call_frame_t *frame, int fd, void *buf, size_t len)
> > >{
> > >   if (frame) { setfsuid(...) ... }
> > >   int ret = writev (fd, buf, len);
> > >   if (frame) { setfsuid(...) ... }
> > >   return ret;
> > >}
> > >#define sys_writev(fd,buf,len) sys_writev_fp (NULL, fd, buf, len)
> > >
> > > That way existing callers don't have to change, but posix can use the
> > > extended versions to get the right setfsuid behavior.
> > >
> > >
> > After trying to do these modifications to test things out, I am now under
> > the impression to remove setfsuid/gid altogether and depend on posix-acl
> > for permission checks. It seems too cumbersome as the operations more
> often
> > than not happen on files inside .glusterfs and non-root users/groups
> don't
> > have permissions at all to access files in that directory.
>
> But the files under .glusterfs are hardlinks. Except for creation and
> removal, should the users not have access to read/write and update
> attributes and xattrs?
>
> I would prefer to rely on the VFS permission checking on the bricks, and
> not bother with the posix-acl xlator when the filesystem on the brick
> supports POSIX ACLs.
>

Could you list down the pros/cons with each approach?


>
> Niels
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-26 Thread Jeff Darcy
> compatible to linux page cache will always to be a better practice
> way, because there is a lot local applications that has already rely
> on its semantics.

I don't think users even *know* how the page cache behaves.  I don't
think even its developers do, in the sense of being able to define it in
sufficient detail for formal verification.  Instead certain cases are
intentionally left undefined - the "odd behavior" Ric mentioned - and
can change any time the implementation does.  What users have are
expectations of things that are guaranteed and things that are not, and
the wiser ones know to stay away from things in the second set even if
they appear to work most of the time.

As Raghavendra Talur points out, we already seem to provide normal
linearizability across file descriptors on a single client.  There
doesn't seem to be much reason to change that.  However, I still
maintain that there's little value to the user in trying to satisfy
stricter POSIX guarantees or closer approximations of whatever the
page cache is doing that day.  That's especially true since we run on
multiple operating systems which almost certainly have different
behavior in some of the edge cases.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Fixing setfsuid/gid problems in posix xlator

2016-09-26 Thread Niels de Vos
On Fri, Sep 23, 2016 at 08:44:14PM +0530, Pranith Kumar Karampuri wrote:
> On Fri, Sep 23, 2016 at 6:12 PM, Jeff Darcy  wrote:
> 
> > > Jiffin found an interesting problem in posix xlator where we have never
> > been
> > > using setfsuid/gid ( http://review.gluster.org/#/c/15545/ ), what I am
> > > seeing regressions after this is, if the files are created using non-root
> > > user then the file creation fails because that user doesn't have
> > permissions
> > > to create the gfid-link. So it seems like the correct way forward for
> > this
> > > patch is to write wrappers around sys_ to do setfsuid/gid do the
> > > actual operation requested and then set it back to old uid/gid and then
> > do
> > > the internal operations. I am planning to write posix_sys_() to
> > do
> > > the same, may be a macro?
> >
> > Kind of an aside, but I'd prefer to see a lot fewer macros in our code.
> > They're not type-safe, and multi-line macros often mess up line numbers for
> > debugging or error messages.  IMO it's better to use functions whenever
> > possible, and usually to let the compiler worry about how/when to inline.
> >
> > > I need inputs from you guys to let me know if I am on the right path and
> > if
> > > you see any issues with this approach.
> >
> > I think there's a bit of an interface problem here.  The sys_xxx wrappers
> > don't have arguments that point to the current frame, so how would they get
> > the correct uid/gid?  We could add arguments to each function, but then
> > we'd have to modify every call.  This includes internal calls which don't
> > have a frame to pass, so I guess they'd have to pass NULL.  Alternatively,
> > we could create a parallel set of functions with frame pointers.  Contrary
> > to what I just said above, this might be a case where macros make sense:
> >
> >int
> >sys_writev_fp (call_frame_t *frame, int fd, void *buf, size_t len)
> >{
> >   if (frame) { setfsuid(...) ... }
> >   int ret = writev (fd, buf, len);
> >   if (frame) { setfsuid(...) ... }
> >   return ret;
> >}
> >#define sys_writev(fd,buf,len) sys_writev_fp (NULL, fd, buf, len)
> >
> > That way existing callers don't have to change, but posix can use the
> > extended versions to get the right setfsuid behavior.
> >
> >
> After trying to do these modifications to test things out, I am now under
> the impression to remove setfsuid/gid altogether and depend on posix-acl
> for permission checks. It seems too cumbersome as the operations more often
> than not happen on files inside .glusterfs and non-root users/groups don't
> have permissions at all to access files in that directory.

But the files under .glusterfs are hardlinks. Except for creation and
removal, should the users not have access to read/write and update
attributes and xattrs?

I would prefer to rely on the VFS permission checking on the bricks, and
not bother with the posix-acl xlator when the filesystem on the brick
supports POSIX ACLs.

Niels


signature.asc
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-26 Thread Raghavendra Talur
On Mon, Sep 26, 2016 at 1:05 PM, Ryan Ding  wrote:

>
> > On Sep 23, 2016, at 8:59 PM, Jeff Darcy  wrote:
> >
> >>> write-behind: implement causal ordering and other cleanup
> >>
> >>> Rules of causal ordering implemented:¬
> >>
> >>> - If request A arrives after the acknowledgement (to the app,¬
> >>
> >>> i.e, STACK_UNWIND) of another request B, then request B is¬
> >>
> >>> said to have 'caused' request A.¬
> >>
> >>
> >> With the above principle, two write requests (p1 and p2 in example
> above)
> >> issued by _two different threads/processes_ there need _not always_ be a
> >> 'causal' relationship (whether there is a causal relationship is purely
> >> based on the "chance" that write-behind chose to ack one/both of them
> and
> >> their timing of arrival).
> >
> > I think this is an issue of terminology.  While it's not *certain* that B
> > (or p1) caused A (or p2), it's *possible*.  Contrast with the case where
> > they overlap, which could not possibly happen if the application were
> > trying to ensure order.  In the distributed-system literature, this is
> > often referred to as a causal relationship even though it's really just
> > the possibility of one, because in most cases even the possibility means
> > that reordering would be unacceptable.
> >
> >> So, current write-behind is agnostic to the
> >> ordering of p1 and p2 (when done by two threads).
> >>
> >> However if p1 and p2 are issued by same thread there is _always_ a
> causal
> >> relationship (p2 being caused by p1).
> >
> > See above.  If we feel bound to respect causal relationships, we have to
> > be pessimistic and assume that wherever such a relationship *could* exist
> > it *does* exist.  However, as I explained in my previous message, I don't
> > think it's practical to provide such a guarantee across multiple clients,
> > and if we don't provide it across multiple clients then it's not worth
> > much to provide it on a single client.  Applications that require such
> > strict ordering shouldn't use write-behind, or should explicitly flush
> > between writes.  Otherwise they'll break unexpectedly when parts are
> > distributed across multiple nodes.  Assuming that everything runs on one
> > node is the same mistake POSIX makes.  The assumption was appropriate
> > for an earlier era, but not now for a decade or more.
>
> We can separate this into 2 question:
> 1. should it be a causal relationship in local application ?
> 2. should it be a causal relationship in a distribute application ?
> I think the answer to #2 is ’NO’. this is an issue that distribute
> application should resolve. the way to resolve it is either use distribute
> lock we provided or use their own way (fsync is required in such condition).
>
True.


> I think the answer to #1 is ‘YES’. because buffer io should not involve
> new data consistency problem than no-buffer io. it’s very common that a
> local application will assume underlying file system to be.
> further more, compatible to linux page cache will always to be a better
> practice way, because there is a lot local applications that has already
> rely on its semantics.
>

I agree. If my understanding, this is the same model that write-behind uses
as of today if we don't consider the patch proposed. Write-behind orders
all causal operations on the inode(file object) irrespective of the FD used.
This particular patch brings a small modification where it lets the FSYNC
and FLUSH FOPs bypass the order as long as they are not on the same FD as
the pending WRITE FOP.

Thanks,
Raghavendra Talur


> Thanks,
> Ryan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] GlusterFS-3.7.16 release approaching

2016-09-26 Thread Kaushal M
Hi all,

GlusterFS-3.7.16 is on target to be released on Sep 30, 4 days from now.

In preparation for the release, maintainers please stop merging
anymore changes into release-3.7.
If any developer has a change that needs to be merged, please reply to
this email before end of day Sep 28.

At this moment, 16 changes have been merged on top of v3.7.15. There
are still 5 patches under review that have been added since the last
release [2].

Thanks,
Kaushal

[1] 
http://review.gluster.org/#/q/project:glusterfs+branch:release-3.7+status:open
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel