Hi Pranith,
it's really a very interesting and hard to find bug, but I'm wondering
how we can prevent this to happen in the general case. There might be
other operations based on pointers to thread local storage that will
suffer this problem. Probably 'errno' is one of the most dangerous, but
TLS is also used to resolve 'THIS' in a very similar way to errno, and
there are temporary uuid and lkowner values also stored into TLS. More
things could appear in the future. These are also potential cases to
have problems with syncops.
An access to THIS before and after a syncop might trigger this bug, right ?
I think it's very difficult to track all these cases and handle them
correctly. Another solution could be to tell the compiler to forget
previous pointer values when a syncop is called, but I don't see any way
to do so. Since it's storing pointers to values, any barrier or volatile
seems useless.
Xavi
El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:
hi,
We found a day-1 bug when syncop_xxx() infra is used inside a synctask
with compilation optimization (CFLAGS -O2). This bug has been dormant for at
least 2 years.
There are around ~400(rebalance, replace-brick, bd, self-heal-daemon, quota,
fuse lock/fd migration) places where syncop is used in the code base all of
which are potential candidates which can take the hit.
I sent first round of patch at http://review.gluster.com/6475 to catch
regressions upstream.
These are the files that are affected by the changes I introduced to fix this:
api/src/glfs-fops.c | 36
++++++++++++++++++++++++++++++++++
api/src/glfs-handleops.c | 15 ++++++++++++++
api/src/glfs-internal.h | 7 +++++++
api/src/glfs-resolve.c | 10 ++++++++++
libglusterfs/src/syncop.c | 117
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
xlators/cluster/afr/src/afr-self-heald.c | 45
+++++++++++++++++++++++++++++-------------
xlators/cluster/afr/src/pump.c | 12 ++++++++++--
xlators/cluster/dht/src/dht-helper.c | 24 +++++++++++++++--------
xlators/cluster/dht/src/dht-rebalance.c | 168
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
xlators/cluster/dht/src/dht-selfheal.c | 6 ++++--
xlators/features/locks/src/posix.c | 3 ++-
xlators/features/qemu-block/src/bdrv-xlator.c | 15 ++++----------
xlators/features/qemu-block/src/qb-coroutines.c | 14 ++++++++++----
xlators/mount/fuse/src/fuse-bridge.c | 16 ++++++++++-----
Please review your respective component for these changes in gerrit.
Thanks
Pranith.
Detailed explanation of the Root cause:
We found the bug in 'gf_defrag_migrate_data' in rebalance operation:
Lets look at interesting parts of the function:
int
gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
dict_t *migrate_data)
{
.....
code section - [ Loop ]
while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
&entries)) != 0) {
.....
code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a
thread)
/* Need to keep track of ENOENT errno, that means, there is no
need to send more readdirp() */
readdir_operrno = errno;
.....
code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
ret = syncop_getxattr (this, &entry_loc, &dict,
GF_XATTR_LINKINFO_KEY);
code section - [ ERRNO-2] (checking for failures of syncop_getxattr(). This
may not always be executed in same thread which executed [SYNCOP-1])
if (ret < 0) {
if (errno != ENODATA) {
loglevel = GF_LOG_ERROR;
defrag->total_failures += 1;
.....
}
the function above could be executed by thread(t1) till [SYNCOP-1] and code
from [ERRNO-2] can be executed by a different thread(t2) because of the way
syncop-infra schedules the tasks.
when the code is compiled with -O2 optimization this is the assembly code that
is generated:
[ERRNO-1]
1165 readdir_operrno = errno; <<---- errno gets expanded
as *(__errno_location())
0x00007fd149d48b60 <+496>: callq 0x7fd149d410c0
<__errno_location@plt>
0x00007fd149d48b72 <+514>: mov %rax,0x50(%rsp) <<------ Address
returned by __errno_location() is stored in a special location in stack for later use.
0x00007fd149d48b77 <+519>: mov (%rax),%eax
0x00007fd149d48b79 <+521>: mov %eax,0x78(%rsp)
....
[ERRNO-2]
1281 if (errno != ENODATA) {
0x00007fd149d492ae <+2366>: mov 0x50(%rsp),%rax <<----- Because
it already stored the address returned by __errno_location(), it just dereferences the
address to get the errno value. BUT THIS CODE NEED NOT BE EXECUTED BY SAME THREAD!!!
0x00007fd149d492b3 <+2371>: mov $0x9,%ebp
0x00007fd149d492b8 <+2376>: mov (%rax),%edi
0x00007fd149d492ba <+2378>: cmp $0x3d,%edi
The problem is that __errno_location() value of t1 and t2 are different. So
[ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is
executing [ERRNO-2] code section.
When code is compiled without any optimization for [ERRNO-2]:
1281 if (errno != ENODATA) {
0x00007fd58e7a326f <+2237>: callq 0x7fd58e797300
<__errno_location@plt><<--- As it is calling __errno_location() again it gets the
location from t2 so it works as intended.
0x00007fd58e7a3274 <+2242>: mov (%rax),%eax
0x00007fd58e7a3276 <+2244>: cmp $0x3d,%eax
0x00007fd58e7a3279 <+2247>: je 0x7fd58e7a32a1
<gf_defrag_migrate_data+2287>
Fix:
We decided to make syncop_xxx() return (-errno) value as the return value in
case of errors and all the functions which make syncop_xxx() will need to use
(-ret) to figure out the reason for failure in case of syncop_xxx() failures.
Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel