Hi all,
I've been trying to track down a bug that looks like data corruption coming
from the Gluster FUSE client. I'm pretty new to Gluster, so perhaps there are
better ways to get to the bottom of this.
I have a distributed, replicated volume that was initialized as 3.5.1 and it
has been functioning fine in general.
Recently we discovered a write corruption issue that is always reproducible by
one of our binaries.
The scenarios is as follows:
There is one writer appending binary entries that have a sequence number and
checksum. Writes are small - 38 bytes each - and there is no fsync while the
file is streaming out.
If this is the only activity, everything is perfectly fine.
However, if that file is simultaneously opened for reading, in a tail-like
fashion, the data written into gluster is corrupt. Reading the file back out of
disk reveals a dropped message (a missing sequence number). This happens when
the reader and writer are different processes, so this is not a bug in our
userspace.
What is even more interesting it that the first run during the writing reports
the sequence number of the failure differently than all subsequent tests.
It feels like there is some kind of cache coherency issue here, but after
putting glusterfs into -L DEBUG and following along, I'm not closer to a
understanding the issue.
stracing glusterfs suggests that the missing records were actually sent to the
remote glusterfs daemons. I have appended that strace at the bottom.
For reference, here is a bit of strace that shows the overall data flow:
The file is opened for writing like so:
32684 open(".../logtest", O_WRONLY|O_CREAT, 0664) = 7
32684 fcntl(7, F_SETFD, FD_CLOEXEC) = 0
32684 fcntl(7, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE)
32684 fcntl(7, F_SETFL, O_WRONLY|O_APPEND|O_LARGEFILE) = 0
32686 write(7, "...", 38) = 38
32686 nanosleep({0, 1000000}, NULL) = 0
The file is opened for reading like so:
open(".../logtest", O_RDONLY) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=912, ...}) = 0
lseek(6, 0, SEEK_CUR) = 0
read(6, "..."..., 131072) = 912
read(6, "", 131072) = 0
lseek(6, 0, SEEK_SET) = 0
read(6, "..."..., 912) = 912
nanosleep({0, 10000000}, NULL) = 0
read(6, "...", 131072) = 114
read(6, "", 131072) = 0
lseek(6, 0, SEEK_SET) = 0
read(6, "..."..., 1026) = 1026
nanosleep({0, 10000000}, NULL) = 0
read(6, "...", 131072) = 114
read(6, "", 131072) = 0
lseek(6, 0, SEEK_SET) = 0
read(6, "..."..., 1140) = 1102
lseek(6, 38, SEEK_CUR) = 1140
nanosleep({0, 10000000}, NULL) = 0
read(6, "...", 131072) = 114
I have been trying to duplicate the failure with a small python script, but I
have as yet been unable to do so. This C++ binary does reproduce it quite
reliably, but the build system for this is quite complex and not readily
shareable.
I hope you can give me some advice about where to look next.
Thanks,
-Mike
Strace from glusterfs client. fd 22 a tcp socket connected to glusterfsd on a
remote host.
30049 writev(22, [{"\200\0\0\342", 4},
{"\0\0\0M\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1\n\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0",
120},
{"\31D\6#\0\0\0\6\23\221\33-\357\306\250@\0\0\0\5\0\0\0\0\273\\^\v8\356\243\372\10\1\30\0
\7", 38}], 4) = 230
This is the last record that shows up in the glusterfs file before the "gap" -
sequence number 7.
30049 writev(22, [{"\200\0\0\314", 4},
{"\0\0\0N\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0",
136}], 3) = 208
30049 writev(22, [{"\200\0\0\220", 4},
{"\0\0\0O\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\31\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
116}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0", 28}], 3)
= 148
30049 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0P\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0",
92}], 3) = 164
30049 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0Q\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1\n\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0",
92}], 3 <unfinished ...>
30058 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0R\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0",
92}], 3 <unfinished ...>
30049 writev(22, [{"\200\0\0\314", 4},
{"\0\0\0S\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
136}], 3) = 208
30049 writev(22, [{"\200\0\0\314", 4},
{"\0\0\0T\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
136}], 3) = 208
30049 writev(22, [{"\200\0\1\10", 4},
{"\0\0\0U\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0L\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0",
120},
{"\31D\6#\0\0\0\6\23\221\33-\357\350\336\20\0\0\0\5\0\0\0\0\345Ob/@P\226\233\10\1\30\0
\10\31D\6#\0\0\0\6\23\221\33-\357\373\321\240\0\0\0\5\0\0\0\0\27$\341,\303\324\25\202\10\1\30\0
\t", 76}], 4) = 268
30049 writev(22, [{"\200\0\0\342", 4},
{"\0\0\0V\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1|\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0",
120},
{"\31D\6#\0\0\0\6\23\221\33-\360\16\212\230\0\0\0\5\0\0\0\0\4t\22\330\323\r4\10\10\1\30\0
\n", 38}], 4) = 230
30049 writev(22, [{"\200\0\0\314", 4},
{"\0\0\0W\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0",
136}], 3) = 208
30049 writev(22, [{"\200\0\0\314", 4},
{"\0\0\0X\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0",
136}], 3) = 208
30049 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0Y\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\f\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
116},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0",
44}], 3) = 164
30049 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0Z\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0",
92}], 3) = 164
30049 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0[\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0",
92}], 3) = 164
30058 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0\\\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0",
92}], 3) = 164
30049 writev(22, [{"\200\0\0\314", 4},
{"\0\0\0]\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
136}], 3) = 208
This is the first record that shows up in the file after the gap - sequence
number 11. Somehow, records 8,9,10 are getting squashed.
30049 writev(22, [{"\200\0\0\342", 4},
{"\0\0\0^\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0",
120},
{"\31D\6#\0\0\0\6\23\221\33-\360mc\330\0\0\0\5\0\0\0\0\366\37\221\333\365V\4\223\10\1\30\0
\v", 38}], 4 <unfinished ...>
30049 writev(22, [{"\200\0\0\200", 4},
{"\0\0\0_\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\20\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\32\0\0\0\1\0\0\0\v\0\0\0\2batch-fsync\0001\0\0\0",
60}], 3) = 132
30049 writev(22, [{"\200\0\0\240", 4},
{"\0\0\0`\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1V\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0",
92}], 3) = 164
30049 writev(22, [{"\200\0\0\314", 4},
{"\0\0\0a\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0",
68},
{"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
136}], 3) = 208
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users