[Apologies if this gets posted twice]

Hi,

I can't seem to get flush-behind to flush behind and I was wondering if anyone would be kind enough to lend some brain cells to help me see what I'm missing.

I'm using glusterfs 2.0.1 on Debian/x86. I've got a "cluster/ replicate" volume replicated between two servers that are connected (over a VPN) over a slooooow (ADSL) link. Synchronous creation of large files over such a link is very slow, so I added the "performance/write-behind" layer with "option flush-behind on". When I write a file to the volume using cp I see the write()s returning implausibly quickly for the speed of the link, as expected, but the final close() still suffers a long delay as though I'd never specified flush-behind. The following strace shows what happens when a 100kB file gets copied into the gluster (which is mounted on /mnt/ tmp). strace prints timestamps at syscall entry.

============
quince:/mnt# strace -ttt cp ~/typically-sized-files/hundredkay /mnt/ tmp/100k 1246310229.679320 execve("/bin/cp", ["cp", "/root/typically-sized- files/hund"..., "/mnt/tmp/100k"], [/* 17 vars */]) = 0
[...]
1246310230.343767 open("/root/typically-sized-files/hundredkay", O_RDONLY|O_LARGEFILE) = 3 1246310230.343818 fstat64(3, {st_mode=S_IFREG|0644, st_size=102400, ...}) = 0 1246310230.343861 open("/mnt/tmp/100k", O_WRONLY|O_CREAT|O_EXCL| O_LARGEFILE, 0644) = 4
1246310230.563629 fstat64(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
1246310230.563694 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 \0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 8192 1246310230.563758 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 \0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 8192 1246310231.480604 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311 \211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 8192 1246310231.480667 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013 \311\211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 8192 1246310231.480940 read(3, "\377\377C;_\10|\360\213E\f\211u\0109p\10\17 \216\220\0\0\0\213e\f\213m\10\...@\f\213"..., 8192) = 8192 1246310231.480990 write(4, "\377\377C;_\10|\360\213E\f\211u\0109p\10 \17\216\220\0\0\0\213e\f\213m\10\...@\f\213"..., 8192) = 8192
[...]
1246310231.481945 read(3, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1 \0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 8192 1246310231.481986 write(4, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1 \0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 8192 1246310231.482063 read(3, "\377\271^\0\377\302v\0\322\247m\0\364\313 \226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 8192) = 4096 1246310231.482102 write(4, "\377\271^\0\377\302v\0\322\247m\0\364\313 \226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 4096) = 4096
1246310231.482174 close(4)              = 0
1246310235.602280 close(3)              = 0
[...]
1246310235.602419 exit_group(0)         = ?
============

This delay at close (4.2 seconds here) gets longer as the file gets bigger - exactly as you'd expect if write()ing was asynchronous but close()ing/flush()ing was synchronous. Here's the (slightly more abbreviated) timings for a 1MB file:

===========
quince:/mnt# strace -ttt cp ~/typically-sized-files/onemeg /mnt/tmp/
1246315526.483531 execve("/bin/cp", ["cp", "/root/typically-sized- files/onem"..., "/mnt/tmp/"], [/* 17 vars */]) = 0
[...]
1246315527.034448 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 \0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 8192 1246315527.034515 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 \0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 8192 1246315527.500059 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311 \211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 8192 1246315527.500113 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013 \311\211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 8192
[...]
1246315527.522606 write(4, "\362\303\34:?/\5+\351\262\277\35\2413\5(eP \0\357\223\226w\373\311r4\325\347\215\30\r\270"..., 8192) = 8192
1246315527.522733 read(3, ""..., 8192)  = 0
1246315527.522770 close(4)              = 0
1246315563.780185 close(3)              = 0
===========

That's 36.26 seconds or ~28kB/s, which is typical of my upload speeds, strongly suggesting glusterfs is copying the file before returning from close().

Each of the two slow-connected machines is a gluster server with a localhost gluster client. I'm using server-based replication (only out of desperation when client-based flush-behind didn't work). This latest config does the write-back on the server, but I've also tried doing it on the client, with the same problem. I'll include my most recent server-replication-based/server-write-delaying config because it's where those straces came from.

Client config (identical both ends):

========
volume loopback
  type protocol/client
  option transport-type tcp
  option remote-host 127.0.0.1
  option remote-subvolume delayedwrite  # name of the remote volume
end-volume
=========

Server config. I'm just giving one side because they differ only by the IP address and hostname.

=========
volume shared-raw
  type storage/posix                   # POSIX FS translator
option directory /mnt/big/fileserving/gluster-test/shared # Export this directory
end-volume

volume quince-shared
  type features/locks
  subvolumes shared-raw
end-volume

volume remote
  type protocol/client
  option transport-type tcp
  option remote-host 10.8.0.3 # aka kumquat
option remote-subvolume kumquat-shared # name of the remote volume
end-volume

volume replicated
  type cluster/replicate
  subvolumes quince-shared remote
  option read-subvolume quince-shared
end-volume

volume threads
  type performance/io-threads
  subvolumes replicated
end-volume

volume delayedwrite
  type performance/write-behind
  option flush-behind on
  option cache-size 4MB
  subvolumes threads
end-volume

volume server
  type protocol/server
  option transport-type tcp
  subvolumes delayedwrite quince-shared
option auth.addr.delayedwrite.allow 127.0.0.1 # Allow access to replicated volume from just the local client option auth.addr.quince-shared.allow 10.8.0.3 # Allow access to lockified tree from kumquat
end-volume
==========

Logs show nothing unusual, and small files replicate between sites just fine with acceptable delay. I've unmounted/stopped/started/ remounted everything every time I've changed the configs, so these straces definitely belong to these configs.

So obviously the question is: why am I seeing delays on close() even though I've got "option flush-behind on"? Is there something else I need to have in my config? What's the best way to debug this if I've done nothing obviously wrong?

For what it's worth, a little more about my setup: got two offices full of secretaries, connected to t'Internet by ADSL links (Infiniband it certainly ain't!). The secretaries create new Word docs, read each other's docs and move files around, but never edit anything. One office is currently accessing the other over Samba over ADSL, which is predictably unusably slow. I'm trying to use GlusterFS to replicate the tree between the two offices to provide both fast local reads and some level of link death resilience, and I'm hoping to layer Samba on top either end. I'll resolve conflicts manually because they're so rare. I seriously considered Coda first, but Gluster just looks so much tidier and more lively and current as a project, and - except for this one problem - seems to do exactly what I want.

Thanks in advance for any light shed,

Zak


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

Reply via email to