[Gluster-users] flush-behind behaving synchronously

Zak Wilcox Tue, 30 Jun 2009 03:25:55 -0700

[Apologies if this gets posted twice]

Hi,

I can't seem to get flush-behind to flush behind and I was wonderingif anyone would be kind enough to lend some brain cells to help mesee what I'm missing.

I'm using glusterfs 2.0.1 on Debian/x86. I've got a "cluster/replicate" volume replicated between two servers that are connected(over a VPN) over a slooooow (ADSL) link. Synchronous creation oflarge files over such a link is very slow, so I added the"performance/write-behind" layer with "option flush-behind on". WhenI write a file to the volume using cp I see the write()s returningimplausibly quickly for the speed of the link, as expected, but thefinal close() still suffers a long delay as though I'd neverspecified flush-behind. The following strace shows what happens whena 100kB file gets copied into the gluster (which is mounted on /mnt/tmp). strace prints timestamps at syscall entry.


============

quince:/mnt# strace -ttt cp ~/typically-sized-files/hundredkay /mnt/tmp/100k1246310229.679320 execve("/bin/cp", ["cp", "/root/typically-sized-files/hund"..., "/mnt/tmp/100k"], [/* 17 vars */]) = 0

[...]

1246310230.343767 open("/root/typically-sized-files/hundredkay",O_RDONLY|O_LARGEFILE) = 31246310230.343818 fstat64(3, {st_mode=S_IFREG|0644,st_size=102400, ...}) = 01246310230.343861 open("/mnt/tmp/100k", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0644) = 4

1246310230.563629 fstat64(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

1246310230.563694 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0\0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 81921246310230.563758 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0\0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 81921246310231.480604 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311\211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 81921246310231.480667 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311\211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 81921246310231.480940 read(3, "\377\377C;_\10|\360\213E\f\211u\0109p\10\17\216\220\0\0\0\213e\f\213m\10\...@\f\213"..., 8192) = 81921246310231.480990 write(4, "\377\377C;_\10|\360\213E\f\211u\0109p\10\17\216\220\0\0\0\213e\f\213m\10\...@\f\213"..., 8192) = 8192

[...]

1246310231.481945 read(3, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1\0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 81921246310231.481986 write(4, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1\0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 81921246310231.482063 read(3, "\377\271^\0\377\302v\0\322\247m\0\364\313\226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 8192) = 40961246310231.482102 write(4, "\377\271^\0\377\302v\0\322\247m\0\364\313\226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 4096) = 4096

1246310231.482174 close(4)              = 0
1246310235.602280 close(3)              = 0
[...]
1246310235.602419 exit_group(0)         = ?
============

This delay at close (4.2 seconds here) gets longer as the file getsbigger - exactly as you'd expect if write()ing was asynchronous butclose()ing/flush()ing was synchronous. Here's the (slightly moreabbreviated) timings for a 1MB file:


===========
quince:/mnt# strace -ttt cp ~/typically-sized-files/onemeg /mnt/tmp/

1246315526.483531 execve("/bin/cp", ["cp", "/root/typically-sized-files/onem"..., "/mnt/tmp/"], [/* 17 vars */]) = 0

[...]

1246315527.034448 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0\0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 81921246315527.034515 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0\0\0\0\0\...@\0\0\0\0\0\0\0\0"..., 8192) = 81921246315527.500059 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311\211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 81921246315527.500113 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311\211h\4\211h\10\211h\f\...@\20\1\0\0\0"..., 8192) = 8192

[...]

1246315527.522606 write(4, "\362\303\34:?/\5+\351\262\277\35\2413\5(eP\0\357\223\226w\373\311r4\325\347\215\30\r\270"..., 8192) = 8192

1246315527.522733 read(3, ""..., 8192)  = 0
1246315527.522770 close(4)              = 0
1246315563.780185 close(3)              = 0
===========

That's 36.26 seconds or ~28kB/s, which is typical of my uploadspeeds, strongly suggesting glusterfs is copying the file beforereturning from close().

Each of the two slow-connected machines is a gluster server with alocalhost gluster client. I'm using server-based replication (onlyout of desperation when client-based flush-behind didn't work). Thislatest config does the write-back on the server, but I've also trieddoing it on the client, with the same problem. I'll include my mostrecent server-replication-based/server-write-delaying config becauseit's where those straces came from.


Client config (identical both ends):

========
volume loopback
  type protocol/client
  option transport-type tcp
  option remote-host 127.0.0.1
  option remote-subvolume delayedwrite  # name of the remote volume
end-volume
=========

Server config. I'm just giving one side because they differ only bythe IP address and hostname.


=========
volume shared-raw
  type storage/posix                   # POSIX FS translator

option directory /mnt/big/fileserving/gluster-test/shared #Export this directory

end-volume

volume quince-shared
  type features/locks
  subvolumes shared-raw
end-volume

volume remote
  type protocol/client
  option transport-type tcp
  option remote-host 10.8.0.3 # aka kumquat

option remote-subvolume kumquat-shared # name of the remotevolume

end-volume

volume replicated
  type cluster/replicate
  subvolumes quince-shared remote
  option read-subvolume quince-shared
end-volume

volume threads
  type performance/io-threads
  subvolumes replicated
end-volume

volume delayedwrite
  type performance/write-behind
  option flush-behind on
  option cache-size 4MB
  subvolumes threads
end-volume

volume server
  type protocol/server
  option transport-type tcp
  subvolumes delayedwrite quince-shared

option auth.addr.delayedwrite.allow 127.0.0.1 # Allow access toreplicated volume from just the local clientoption auth.addr.quince-shared.allow 10.8.0.3 # Allow access tolockified tree from kumquat

end-volume
==========

Logs show nothing unusual, and small files replicate between sitesjust fine with acceptable delay. I've unmounted/stopped/started/remounted everything every time I've changed the configs, so thesestraces definitely belong to these configs.

So obviously the question is: why am I seeing delays on close() eventhough I've got "option flush-behind on"? Is there something else Ineed to have in my config? What's the best way to debug this if I'vedone nothing obviously wrong?

For what it's worth, a little more about my setup: got two officesfull of secretaries, connected to t'Internet by ADSL links(Infiniband it certainly ain't!). The secretaries create new Worddocs, read each other's docs and move files around, but never editanything. One office is currently accessing the other over Sambaover ADSL, which is predictably unusably slow. I'm trying to useGlusterFS to replicate the tree between the two offices to provideboth fast local reads and some level of link death resilience, andI'm hoping to layer Samba on top either end. I'll resolve conflictsmanually because they're so rare. I seriously considered Coda first,but Gluster just looks so much tidier and more lively and current asa project, and - except for this one problem - seems to do exactlywhat I want.


Thanks in advance for any light shed,

Zak


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] flush-behind behaving synchronously

Reply via email to