On Sat, Apr 19, 2014 at 2:18 PM, dormando <[email protected]> wrote:
> On Sat, 19 Apr 2014, Dan McGee wrote: > > > On Sat, Apr 19, 2014 at 1:45 PM, dormando <[email protected]> wrote: > > > On Sat, Apr 19, 2014 at 12:43 PM, dormando <[email protected]> > wrote: > > > Well, that learns me for trying to write software without > the 10+ VM > > > buildbots... > > > > > > The i386 one, can you include the output of "stats > settings", and also > > > manually run: "lru_crawler enable" (or start with -o > lru_crawler) then run > > > "stats settings" again please? Really weird that it fails > there, but not > > > the lines before it looking for the "OK" while enabling it. > > > > > > > > > As soon as I type "lru_crawler enable", memcached crashes. I see > this in dmesg. > > > > > > [189571.108397] traps: memcached-debug[31776] general protection > ip:f7749988 sp:f47ff2d8 error:0 in > > libpthread-2.19.so[f7739000+18000] > > > [189969.840918] traps: memcached-debug[2600] general protection > ip:7f976510a1c8 sp:7f976254aed8 error:0 in > > libpthread-2.19.so[7f97650f9000+18000] > > > [195892.554754] traps: memcached-debug[31871] general protection > ip:f76f0988 sp:f46ff2d8 error:0 in > > libpthread-2.19.so[f76e0000+18000] > > > > > > Starting with "-o lru_crawler" also crashes. > > > > > > [195977.276379] traps: memcached-debug[2182] general protection > ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000] > > > > > > This is running both 32 bit and 64 bit executables on the same > build box; note in the above dmesg output that two of them appear to > > be from 32-bit > > > processes, and we also see a crash in what looks a lot like a 64 > bit pointer address, if I'm reading this right... > > > > Uhh... is your cross compile goofed? > > > > Any chance you could start the memcached-debug binary under gdb and then > > crash it the same way? Get a full stack trace. > > > > Thinking if I even have a 32bit host left somewhere to test with... will > > have to spin up the VM's later, but a stacktrace might be enlightening > > anyway. > > > > > > Program received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 0xf7dbfb40 (LWP 7)] > > 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 > > (gdb) bt > > #0 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 > > #1 0xf7f790e0 in __pthread_mutex_unlock_usercnt () from > /usr/lib/libpthread.so.0 > > #2 0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib/libpthread.so.0 > > #3 0x08061bfe in item_crawler_thread () > > #4 0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0 > > #5 0xf7ead94e in clone () from /usr/lib/libc.so.6 > > Holy crap lock elision. I have one machine with a haswell chip here, but > I'll have to USB boot. Is getting an Arch liveimage especially time > consuming? > Not at all; if you download the latest install ISO ( https://www.archlinux.org/download/) it is a live CD and you can boot straight into an Arch environment. You can do an install if you want, or just run live and install any necessary packages (`pacman -S base-devel gdb`) and go from there. > > https://github.com/dormando/memcached/tree/crawler_fix > > Can you try this? The lock elision might've made my "undefined behavior" > mistake of not holding a lock before initially waiting on the condition > fatal. > > A further fix might be required, as it's possible someone could kill the > do_etc flag before the thread fully starts and it'd drop out with the lock > held. That would be an incredible feat though. > The good news here is now that we found our way to lock elision, both 64-bit and 32-bit builds (including one straight from git and outside the normal packaging build machinery) blow up in the same place. No segfault after applying this patch, so we've made progress. > > > > Thanks! > > > > > > > > On the 64bit host, can you try increasing the sleep on > t/lru-crawler.t:39 > > > from 3 to 8 and try again? I was trying to be clever but > that may not be > > > working out. > > > > > > > > > Didn't change anything, same two failures with the same output > listed. > > > > I feel like something's a bit different between your two tests. In the > > first set, it's definitely not crashing for the 64bit test, but not > > working either. Is something weird going on with the second set of tests? > > You noted it seems to be running a 32bit binary still. > > > > I'm willing to ignore the 64-bit failures for now until we figure out > the 32-bit ones. > > > > In any case, I wouldn't blame the cross-compile or toolchain, these have > all been built in very clean, single architecture systemd-nspawn chroots. > > Thanks, I'm just trying to reason why it's failing in two different ways. > The initial failure of finding 90 items when it expected 60 is a timing > glitch, the other ones are this thread crashing the daemon. > One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to cause problems, especially if we are dealing with threading-type things? On the i7 machine, I think we're still experiencing segfaults. Running just the LRU test; note the two "undef" values showing up again: $ prove t/lru-crawler.t t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Failed test 'disabled lru crawler' # at t/lru-crawler.t line 69. # got: undef # expected: 'OK # ' # Failed test at t/lru-crawler.t line 72. # got: undef # expected: 'no' # Looks like you failed 4 tests of 189. t/lru-crawler.t .. Dubious, test returned 4 (wstat 1024, 0x400) Failed 4/189 subtests Changing the `sleep 3` to `sleep 8` gives non-deterministic results; two runs in a row were different. $ prove t/lru-crawler.t t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Failed test 'ifoo29 == 'ok'' # at /home/dan/memcached/t/lib/MemcachedTest.pm line 59. # got: undef # expected: 'VALUE ifoo29 0 2 # ok # END # ' t/lru-crawler.t .. Failed 10/189 subtests Test Summary Report ------------------- t/lru-crawler.t (Wstat: 13 Tests: 182 Failed: 3) Failed tests: 96-97, 182 Non-zero wait status: 13 Parse errors: Bad plan. You planned 189 tests but ran 182. Files=1, Tests=182, 8 wallclock secs ( 0.03 usr 0.00 sys + 0.04 cusr 0.00 csys = 0.07 CPU) Result: FAIL $ prove t/lru-crawler.t t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Failed test 'sfoo28 == <undef>' # at /home/dan/memcached/t/lib/MemcachedTest.pm line 53. # got: undef # expected: 'END # ' t/lru-crawler.t .. Failed 11/189 subtests Test Summary Report ------------------- t/lru-crawler.t (Wstat: 13 Tests: 181 Failed: 3) Failed tests: 96-97, 181 Non-zero wait status: 13 Parse errors: Bad plan. You planned 189 tests but ran 181. Files=1, Tests=181, 8 wallclock secs ( 0.02 usr 0.00 sys + 0.03 cusr 0.00 csys = 0.05 CPU) Result: FAIL -Dan -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
