Re: [Gluster-users] Questions about gluster/fuse, page cache, and coherence

2013-03-23 Thread Anand Avati
Please find answers below -

On Mon, Mar 18, 2013 at 12:03 AM, nlxswig  wrote:

> Good questions,
> Why are there no reply?
>
> At 2011-08-16 04:53:50,"Patrick J. LoPresti"  wrote:
> >(FUSE developers:  Although my questions are specifically about
> >Gluster, I suspect most of the answers have more to do with FUSE, so I
> >figure this is on-topic for your list.  If I figured wrong, I
> >apologize.)
> >
> >I have done quite a bit of searching looking for answers to these
> >questions, and I just cannot find them...
> >
> >I think I understand how the Linux page cache works for an ordinary
> >local (non-FUSE) partition.  Specifically:
> >
> >1) When my application calls read(), it reads from the page cache.  If
> >the page(s) are not resident, the kernel puts my application to sleep
> >and gets busy reading them from disk.
> >
> >2) When my application calls write(), it writes to the page cache.
> >The kernel will -- eventually, when it feels like it -- flush those
> >dirty pages to disk.
> >
> >3) When my application calls mmap(), page cache pages are mapped into
> >my process's address space, allowing me to create a dirty page or read
> >a page by accessing memory.
> >
> >4) When the kernel reads a page, it might decide to read some other
> >pages, depending on the underlying block device's read-ahead
> >parameters.  I can control these via "blockdev".  On the write side, I
> >can exercise some control with various VM parameters (dirty_ratio
> >etc).  I can also use calls like fsync() and posix_fadvise() to exert
> >some control over page cache management at the application level.
> >
> >
> >My question is pretty simple.  If you had to re-write the above four
> >points for a Gluster file system, what would they look like?  If it
> >matters, I am specifically interested in Gluster 3.2.2 on Suse Linux
> >Enterprise Server 11 SP1 (Linux 2.6.32.43 + whatever Suse does to
> >their kernels).
> >
> >Does Gluster use the page cache on read()?  On write()?  If so, how
> >does it ensure coherency between clients?  If not, how does mmap()
> >work (or does it not work)?
>
> Gluster or any FUSE filesystem by themselves do not use the page-cache
directly. It serves read/write requests by either reading from or writing
to /dev/fuse. The read/write implementations of the /dev/fuse "device"
perform the copy. Now where the perform the copy to/from depends on whether
the file is open with O_DIRECT and/or if "direct_io" was enabled on the
open file. For "normal" IO, the copy happens to/from the page cache. For
O_DIRECT or "direct_io" page-cache is bypassed completely, but care is
taken to make sure that the copy of data in the page cache is flushed -- as
a best effort attempt -- to give a consistent "view" of the file between
two applications (on the SAME mount point ONLY) which are opening the file
with different modes (O_DIRECT and otherwise).

As long as all the mounts are using "direct_io" mount option, coherency
between mounts is really in the hands of the filesystem (like gluster) as
FUSE is acting like a pure pass-through. On the other hand, if "normal" IO
is happening, utilizing the page cache, then re-reads can always get served
directly from the page-cache without the filesystem (like gluster) even
knowing that a read() request was issued by a process. The filesystem could
however use the reverse invalidation calls to invalidate the pages in all
mounts if a write is happening from elsewhere (the co-ordination needs to
happen in the filesystem, FUSE only provides the invalidation primitives)
-- Gluster does NOT do this yet.

There is also a flag in open() FUSE operation to indicate whether or not to
keep the page cache of the file. By default gluster asks FUSE to purge the
page cache in open(). This provides you close-to-open consistency (i.e, if
an open() from a process is performed strictly after close() from any other
process, even on a different machine, then you are guaranteed to see all
the content written by that application -- very similar consistency offered
by NFS (v3) client in Linux.)

In summary, this means by default you get close-to-open consistency with
gluster, but if you require strict consistency between two applications on
different client which have opened the file at the same time, then you need
BOTH a and b:

a. Either app opens with O_DIRECT or mount glusterfs with
--enable-direct-io to keep page-cache out of the way of consistency

b. Either app opens with with O_DSYNC (or O_SYNC) or disable write-behind
in the gluster volume configuration.

W.R.T mmap(), Getting strict consistency between the "shared" mapped
regions of two applications on different machines is pretty much impossible
(the filesystem/kernel knows only the first time an app attempts to "write"
to the mapped region with a page fault, but once the page is marked dirty
in the first write, nobody is getting notified that the app is modifying
other memory regions of that page). There are four combinations - private
vs shared, and mm

Re: [Gluster-users] [Gluster-devel] Nightly rpms?

2013-03-23 Thread Justin Clift
On 18/03/2013, at 9:12 AM, Nux! wrote:
> Hello,
> 
> On some occasions with 3.4 for example I seemed to hit bugs that not only 
> were already reported, but in some cases even fixed (like some recent quota 
> failed issue). Is there a place where I could get nightly or at least weekly 
> RPMs?
> This way at least I'll hit new or unresolved bugs ... :-)

If it's helpful, just added instructions to the wiki for building
RPMS from git source (for Fedora 17 and CentOS 6.4):

  http://www.gluster.org/community/documentation/index.php/CompilingRPMS

It's actually pretty simple (once you know how). :)

If someone know's the steps for building the RPMs for SuSE and
similar, please add them. :)

Regards and best wishes,

Justin Clift


> -- 
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ___
> Gluster-devel mailing list
> gluster-de...@nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] cp: skipping file $FILEPATH as it was replaced while being copied

2013-03-23 Thread Brian Foster
On 03/22/2013 05:11 PM, Patrick Regan wrote:
> I have a 8-node 2 replica Gluster volume mounted with the fuse client.
> We also have an in-house Perl script we use for doing block string
> substitutions. If we run this script on directories on the volume, I
> get the following error on almost every file:
> 
> cp: skipping file $FILEPATH as it was replaced while being copied
> 

That error is preceded with the following comment in coreutils:

  /* Compare the source dev/ino from the open file to the incoming,
 saved ones obtained via a previous call to stat.  */

... which basically means the source inode has changed between the time
this particular function runs a stat on the source path and some
previous call had run the same thing. Had you confirmed which cp command
in your script results in this error (e.g., it looks like you have a cp
to temp and a cp to orig)?

I know nothing about perl (and the lack of indentation in the script
post is making my eyes cross :P), but have you thought about the level
of serialization you're getting in the perl script between the cp
commands and programmatic file access? In other words, is it possible
the work your script invokes via print and `` operators is racing with
the script itself?

Brian

> If I mount the volume with NFS, this issue does not arise.
> 
> I have also tried to replicate the issue using a roughly equivalent
> shell script, but the shell script does not produce the same result on
> either the fuse or the nfs client.
> 
> I'll paste my volume log followed by the perl script, followed by my
> rough shell script.
> 
> I would appreciate any feedback.
> 
> Thanks!
> 
> -
> 
> [2013-03-22 16:32:52.893362] I [glusterfsd.c:1666:main]
> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version
> 3.3.1
> [2013-03-22 16:32:52.932157] I [io-cache.c:1549:check_cache_size_ok]
> 0-usrweb-quick-read: Max cache size is 2124763136
> [2013-03-22 16:32:52.932634] I [io-cache.c:1549:check_cache_size_ok]
> 0-usrweb-io-cache: Max cache size is 2124763136
> [2013-03-22 16:32:53.035759] I [client.c:2142:notify]
> 0-usrweb-client-0: parent translators are ready, attempting connect on
> transport
> [2013-03-22 16:32:53.041228] I [client.c:2142:notify]
> 0-usrweb-client-1: parent translators are ready, attempting connect on
> transport
> [2013-03-22 16:32:53.046270] I [client.c:2142:notify]
> 0-usrweb-client-2: parent translators are ready, attempting connect on
> transport
> [2013-03-22 16:32:53.050863] I [client.c:2142:notify]
> 0-usrweb-client-3: parent translators are ready, attempting connect on
> transport
> Given volfile:
> +--+
>   1: volume usrweb-client-0
>   2: type protocol/client
>   3: option remote-host ak001
>   4: option remote-subvolume /srv/gluster/volusrweb
>   5: option transport-type tcp
>   6: end-volume
>   7:
>   8: volume usrweb-client-1
>   9: type protocol/client
>  10: option remote-host ak002
>  11: option remote-subvolume /srv/gluster/volusrweb
>  12: option transport-type tcp
>  13: end-volume
>  14:
>  15: volume usrweb-client-2
>  16: type protocol/client
>  17: option remote-host ak003
>  18: option remote-subvolume /srv/gluster/volusrweb
>  19: option transport-type tcp
>  20: end-volume
>  21:
>  22: volume usrweb-client-3
>  23: type protocol/client
>  24: option remote-host ak004
>  25: option remote-subvolume /srv/gluster/volusrweb
>  26: option transport-type tcp
>  27: end-volume
>  28:
>  29: volume usrweb-replicate-0
>  30: type cluster/replicate
>  31: subvolumes usrweb-client-0 usrweb-client-1
>  32: end-volume
>  33:
>  34: volume usrweb-replicate-1
>  35: type cluster/replicate
>  36: subvolumes usrweb-client-2 usrweb-client-3
>  37: end-volume
>  38:
>  39: volume usrweb-dht
>  40: type cluster/distribute
>  41: subvolumes usrweb-replicate-0 usrweb-replicate-1
>  42: end-volume
>  43:
>  44: volume usrweb-write-behind
>  45: type performance/write-behind
>  46: subvolumes usrweb-dht
>  47: end-volume
>  48:
>  49: volume usrweb-read-ahead
>  50: type performance/read-ahead
>  51: subvolumes usrweb-write-behind
>  52: end-volume
>  53:
>  54: volume usrweb-io-cache
>  55: type performance/io-cache
>  56: subvolumes usrweb-read-ahead
>  57: end-volume
>  58:
>  59: volume usrweb-quick-read
>  60: type performance/quick-read
>  61: subvolumes usrweb-io-cache
>  62: end-volume
>  63:
>  64: volume usrweb-md-cache
>  65: type performance/md-cache
>  66: subvolumes usrweb-quick-read
>  67: end-volume
>  68:
>  69: volume usrweb
>  70: type debug/io-stats
>  71: option latency-measurement off
>  72: option count-fop-hits off
>  73: subvolumes usrweb-md-cache
>  74: end-volume
> 
> +--+
> [2013-03-22 16:32:53.057710] I [r