Re: [Gluster-users] How caches are working on AFR?

Anand Babu Periasamy Mon, 09 Mar 2009 01:54:07 -0700

Stas Oskin wrote:

Hi.


2009/3/8 Anand Babu Periasamy <[email protected] <mailto:[email protected]>>

    Replicate in 2.0 performs atomic writes by default. This means,
    writes will return control
    back to application only after both the volumes (or more) are
    successfully written.

Ok, so without write-behind cache, only when data physically written toall AFR disk, the app would continue?


Yes. Preciously speaking, when data is handed over to underlying diskfs
and not physically written to disk. It may be written or journaled.

Every parallel write operation is a transaction. It has to complete
atomically on all volumes. If a volume is down, incomplete files
are marked pending. It doesn't block then.

    To mask the performance penalty of atomic writes, you should load
    write-behind on top of
    it. Write-behind returns control as soon as it receives the write
    call from the
    application, but it continues to write in background. Write-behind
    also performs
    block-aggregation. Smaller writes are aggregated into fewer large
    writes.

    POSIX says application should verify the return status of close
    system call to ensure all
    writes were successfully written. If they are any pending writes,
    close call will block to
     ensure all the data is completely written. There is an option in
    write-behind to even
    close in background. It is unsafe and turned off by default.

So I need to call close() per each file (which should be donenevertheless for correct operations), in order to insure all was writtento disk?


And if the close() fails - this means some of the data was lost?

Yes correct. This behavior is expected even for regular disk file systems.

If you want every write to be physically written to disk, you should
either open with O_DIRECT or flush or use appropriate file system APIs
for synchronous writes. GlusterFS respects all the flags/APIs and turns off
write-behind or any such optimizations appropriately.

    Applications that expect every write to succeed, issues synchronous
    writes.
By this you mean that no write-behind should be used, only the defaultatomic writes behavior?


No, Write-behind is good. Even NFS and regular disk file systems behave
exactly like this.  See the excerpt from GNU Glibc reference manual below.

In GlusterFS, all of the functionalities including basic performance
features are implemented as modules. You will get awful performance
with out these modules loaded. You can only expect GlusterFS to
be functionally right.

--------[ FROM GLIBC DOC ]--------------------------------
for write (..)
     Once `write' returns, the data is enqueued to be written and can be
     read back right away, but it is not necessarily written out to
     permanent storage immediately.  You can use `fsync' when you need
     to be sure your data has been permanently stored before
     continuing.  (It is more efficient for the system to batch up
     consecutive writes and do them all at once when convenient.
     Normally they will always be written to disk within a minute or
     less.)  Modern systems provide another function `fdatasync' which
     guarantees integrity only for the file data and is therefore
     faster.  You can use the `O_FSYNC' open mode to make `write' always
     store the data to disk before returning;

for close (..)
`ENOSPC'
`EIO'
`EDQUOT'
     When the file is accessed by NFS, these errors from `write'
     can sometimes not be detected until `close'.  *Note I/O
     Primitives::, for details on their meaning.
----------------------------------------------------------

--
Anand Babu Periasamy
GPG Key ID: 0x62E15A31
Blog [http://ab.multics.org]
GlusterFS [http://www.gluster.org]
The GNU Operating System [http://www.gnu.org]

_______________________________________________
Gluster-users mailing list
[email protected]
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] How caches are working on AFR?

Reply via email to