Re: [Gluster-users] Questions about gluster/fuse, page cache, and coherence

2013-03-23 Thread Anand Avati
Please find answers below -

On Mon, Mar 18, 2013 at 12:03 AM, nlxswig nlxs...@126.com wrote:

 Good questions,
 Why are there no reply?

 At 2011-08-16 04:53:50,Patrick J. LoPresti lopre...@gmail.com wrote:
 (FUSE developers:  Although my questions are specifically about
 Gluster, I suspect most of the answers have more to do with FUSE, so I
 figure this is on-topic for your list.  If I figured wrong, I
 apologize.)
 
 I have done quite a bit of searching looking for answers to these
 questions, and I just cannot find them...
 
 I think I understand how the Linux page cache works for an ordinary
 local (non-FUSE) partition.  Specifically:
 
 1) When my application calls read(), it reads from the page cache.  If
 the page(s) are not resident, the kernel puts my application to sleep
 and gets busy reading them from disk.
 
 2) When my application calls write(), it writes to the page cache.
 The kernel will -- eventually, when it feels like it -- flush those
 dirty pages to disk.
 
 3) When my application calls mmap(), page cache pages are mapped into
 my process's address space, allowing me to create a dirty page or read
 a page by accessing memory.
 
 4) When the kernel reads a page, it might decide to read some other
 pages, depending on the underlying block device's read-ahead
 parameters.  I can control these via blockdev.  On the write side, I
 can exercise some control with various VM parameters (dirty_ratio
 etc).  I can also use calls like fsync() and posix_fadvise() to exert
 some control over page cache management at the application level.
 
 
 My question is pretty simple.  If you had to re-write the above four
 points for a Gluster file system, what would they look like?  If it
 matters, I am specifically interested in Gluster 3.2.2 on Suse Linux
 Enterprise Server 11 SP1 (Linux 2.6.32.43 + whatever Suse does to
 their kernels).
 
 Does Gluster use the page cache on read()?  On write()?  If so, how
 does it ensure coherency between clients?  If not, how does mmap()
 work (or does it not work)?

 Gluster or any FUSE filesystem by themselves do not use the page-cache
directly. It serves read/write requests by either reading from or writing
to /dev/fuse. The read/write implementations of the /dev/fuse device
perform the copy. Now where the perform the copy to/from depends on whether
the file is open with O_DIRECT and/or if direct_io was enabled on the
open file. For normal IO, the copy happens to/from the page cache. For
O_DIRECT or direct_io page-cache is bypassed completely, but care is
taken to make sure that the copy of data in the page cache is flushed -- as
a best effort attempt -- to give a consistent view of the file between
two applications (on the SAME mount point ONLY) which are opening the file
with different modes (O_DIRECT and otherwise).

As long as all the mounts are using direct_io mount option, coherency
between mounts is really in the hands of the filesystem (like gluster) as
FUSE is acting like a pure pass-through. On the other hand, if normal IO
is happening, utilizing the page cache, then re-reads can always get served
directly from the page-cache without the filesystem (like gluster) even
knowing that a read() request was issued by a process. The filesystem could
however use the reverse invalidation calls to invalidate the pages in all
mounts if a write is happening from elsewhere (the co-ordination needs to
happen in the filesystem, FUSE only provides the invalidation primitives)
-- Gluster does NOT do this yet.

There is also a flag in open() FUSE operation to indicate whether or not to
keep the page cache of the file. By default gluster asks FUSE to purge the
page cache in open(). This provides you close-to-open consistency (i.e, if
an open() from a process is performed strictly after close() from any other
process, even on a different machine, then you are guaranteed to see all
the content written by that application -- very similar consistency offered
by NFS (v3) client in Linux.)

In summary, this means by default you get close-to-open consistency with
gluster, but if you require strict consistency between two applications on
different client which have opened the file at the same time, then you need
BOTH a and b:

a. Either app opens with O_DIRECT or mount glusterfs with
--enable-direct-io to keep page-cache out of the way of consistency

b. Either app opens with with O_DSYNC (or O_SYNC) or disable write-behind
in the gluster volume configuration.

W.R.T mmap(), Getting strict consistency between the shared mapped
regions of two applications on different machines is pretty much impossible
(the filesystem/kernel knows only the first time an app attempts to write
to the mapped region with a page fault, but once the page is marked dirty
in the first write, nobody is getting notified that the app is modifying
other memory regions of that page). There are four combinations - private
vs shared, and mmap on direct_io file vs normal file.

shared and direct_io - not even 

Re: [Gluster-users] Questions about gluster/fuse, page cache, and coherence

2013-03-18 Thread nlxswig
Good questions, 
Why are there no reply?


At 2011-08-16 04:53:50,Patrick J. LoPresti lopre...@gmail.com wrote:
(FUSE developers:  Although my questions are specifically about
Gluster, I suspect most of the answers have more to do with FUSE, so I
figure this is on-topic for your list.  If I figured wrong, I
apologize.)

I have done quite a bit of searching looking for answers to these
questions, and I just cannot find them...

I think I understand how the Linux page cache works for an ordinary
local (non-FUSE) partition.  Specifically:

1) When my application calls read(), it reads from the page cache.  If
the page(s) are not resident, the kernel puts my application to sleep
and gets busy reading them from disk.

2) When my application calls write(), it writes to the page cache.
The kernel will -- eventually, when it feels like it -- flush those
dirty pages to disk.

3) When my application calls mmap(), page cache pages are mapped into
my process's address space, allowing me to create a dirty page or read
a page by accessing memory.

4) When the kernel reads a page, it might decide to read some other
pages, depending on the underlying block device's read-ahead
parameters.  I can control these via blockdev.  On the write side, I
can exercise some control with various VM parameters (dirty_ratio
etc).  I can also use calls like fsync() and posix_fadvise() to exert
some control over page cache management at the application level.


My question is pretty simple.  If you had to re-write the above four
points for a Gluster file system, what would they look like?  If it
matters, I am specifically interested in Gluster 3.2.2 on Suse Linux
Enterprise Server 11 SP1 (Linux 2.6.32.43 + whatever Suse does to
their kernels).

Does Gluster use the page cache on read()?  On write()?  If so, how
does it ensure coherency between clients?  If not, how does mmap()
work (or does it not work)?

What read-ahead will the kernel use?  Does posix_fadvise(...,
POSIX_FADV_WILLNEED) have any effect on a Gluster file system?

I find it hard to imagine that I am the only person with questions
like these...  Did I miss a FAQ list somewhere?

Thanks.

 - Pat
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users