I have found that the O_APPEND flag is key to this failure - I had overlooked that flag when reading the strace and trying to cobble up a minimal reproduction.
I now have a small pair of python scripts that can reliably reproduce this failure. On Sep 5, 2014, at 11:11 AM, mike <m...@luminatewireless.com> wrote: > Hi all, > > I've been trying to track down a bug that looks like data corruption coming > from the Gluster FUSE client. I'm pretty new to Gluster, so perhaps there are > better ways to get to the bottom of this. > > I have a distributed, replicated volume that was initialized as 3.5.1 and it > has been functioning fine in general. > > Recently we discovered a write corruption issue that is always reproducible > by one of our binaries. > > The scenarios is as follows: > > There is one writer appending binary entries that have a sequence number and > checksum. Writes are small - 38 bytes each - and there is no fsync while the > file is streaming out. > > If this is the only activity, everything is perfectly fine. > > However, if that file is simultaneously opened for reading, in a tail-like > fashion, the data written into gluster is corrupt. Reading the file back out > of disk reveals a dropped message (a missing sequence number). This happens > when the reader and writer are different processes, so this is not a bug in > our userspace. > > What is even more interesting it that the first run during the writing > reports the sequence number of the failure differently than all subsequent > tests. > > It feels like there is some kind of cache coherency issue here, but after > putting glusterfs into -L DEBUG and following along, I'm not closer to a > understanding the issue. > > stracing glusterfs suggests that the missing records were actually sent to > the remote glusterfs daemons. I have appended that strace at the bottom. > > For reference, here is a bit of strace that shows the overall data flow: > > The file is opened for writing like so: > > 32684 open(".../logtest", O_WRONLY|O_CREAT, 0664) = 7 > 32684 fcntl(7, F_SETFD, FD_CLOEXEC) = 0 > 32684 fcntl(7, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) > 32684 fcntl(7, F_SETFL, O_WRONLY|O_APPEND|O_LARGEFILE) = 0 > 32686 write(7, "...", 38) = 38 > 32686 nanosleep({0, 1000000}, NULL) = 0 > > > The file is opened for reading like so: > > open(".../logtest", O_RDONLY) = 6 > fstat(6, {st_mode=S_IFREG|0644, st_size=912, ...}) = 0 > lseek(6, 0, SEEK_CUR) = 0 > read(6, "..."..., 131072) = 912 > read(6, "", 131072) = 0 > lseek(6, 0, SEEK_SET) = 0 > read(6, "..."..., 912) = 912 > nanosleep({0, 10000000}, NULL) = 0 > read(6, "...", 131072) = 114 > read(6, "", 131072) = 0 > lseek(6, 0, SEEK_SET) = 0 > read(6, "..."..., 1026) = 1026 > nanosleep({0, 10000000}, NULL) = 0 > read(6, "...", 131072) = 114 > read(6, "", 131072) = 0 > lseek(6, 0, SEEK_SET) = 0 > read(6, "..."..., 1140) = 1102 > lseek(6, 38, SEEK_CUR) = 1140 > nanosleep({0, 10000000}, NULL) = 0 > read(6, "...", 131072) = 114 > > I have been trying to duplicate the failure with a small python script, but I > have as yet been unable to do so. This C++ binary does reproduce it quite > reliably, but the build system for this is quite complex and not readily > shareable. > > I hope you can give me some advice about where to look next. > > Thanks, > -Mike > > > > Strace from glusterfs client. fd 22 a tcp socket connected to glusterfsd on a > remote host. > > > 30049 writev(22, [{"\200\0\0\342", 4}, > {"\0\0\0M\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1\n\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", > 120}, > {"\31D\6#\0\0\0\6\23\221\33-\357\306\250@\0\0\0\5\0\0\0\0\273\\^\v8\356\243\372\10\1\30\0 > \7", 38}], 4) = 230 > > This is the last record that shows up in the glusterfs file before the "gap" > - sequence number 7. > > 30049 writev(22, [{"\200\0\0\314", 4}, > {"\0\0\0N\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", > 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\220", 4}, > {"\0\0\0O\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\31\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 116}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0", 28}], > 3) = 148 > 30049 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0P\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", > 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0Q\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1\n\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", > 92}], 3 <unfinished ...> > 30058 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0R\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", > 92}], 3 <unfinished ...> > 30049 writev(22, [{"\200\0\0\314", 4}, > {"\0\0\0S\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", > 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\314", 4}, > {"\0\0\0T\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", > 136}], 3) = 208 > 30049 writev(22, [{"\200\0\1\10", 4}, > {"\0\0\0U\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0L\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", > 120}, > {"\31D\6#\0\0\0\6\23\221\33-\357\350\336\20\0\0\0\5\0\0\0\0\345Ob/@P\226\233\10\1\30\0 > > \10\31D\6#\0\0\0\6\23\221\33-\357\373\321\240\0\0\0\5\0\0\0\0\27$\341,\303\324\25\202\10\1\30\0 > \t", 76}], 4) = 268 > 30049 writev(22, [{"\200\0\0\342", 4}, > {"\0\0\0V\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1|\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", > 120}, > {"\31D\6#\0\0\0\6\23\221\33-\360\16\212\230\0\0\0\5\0\0\0\0\4t\22\330\323\r4\10\10\1\30\0 > \n", 38}], 4) = 230 > 30049 writev(22, [{"\200\0\0\314", 4}, > {"\0\0\0W\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", > 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\314", 4}, > {"\0\0\0X\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", > 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0Y\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\f\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 116}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0", > 44}], 3) = 164 > 30049 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0Z\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", > 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0[\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", > 92}], 3) = 164 > 30058 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0\\\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", > 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\314", 4}, > {"\0\0\0]\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", > 136}], 3) = 208 > > This is the first record that shows up in the file after the gap - sequence > number 11. Somehow, records 8,9,10 are getting squashed. > > 30049 writev(22, [{"\200\0\0\342", 4}, > {"\0\0\0^\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", > 120}, > {"\31D\6#\0\0\0\6\23\221\33-\360mc\330\0\0\0\5\0\0\0\0\366\37\221\333\365V\4\223\10\1\30\0 > \v", 38}], 4 <unfinished ...> > 30049 writev(22, [{"\200\0\0\200", 4}, > {"\0\0\0_\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\20\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\32\0\0\0\1\0\0\0\v\0\0\0\2batch-fsync\0001\0\0\0", > 60}], 3) = 132 > 30049 writev(22, [{"\200\0\0\240", 4}, > {"\0\0\0`\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1V\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", > 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\314", 4}, > {"\0\0\0a\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0", > 68}, > {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", > 136}], 3) = 208 > > _______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users