Re: 1.4.18

dormando Sat, 19 Apr 2014 12:45:47 -0700

>       > Program received signal SIGSEGV, Segmentation fault.
>       > [Switching to Thread 0xf7dbfb40 (LWP 7)]
>       > 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
>       > (gdb) bt
>       > #0  0xf7f7f988 in __lll_unlock_elision () from 
> /usr/lib/libpthread.so.0
>       > #1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from 
> /usr/lib/libpthread.so.0
>       > #2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib/libpthread.so.0
>       > #3  0x08061bfe in item_crawler_thread ()
>       > #4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
>       > #5  0xf7ead94e in clone () from /usr/lib/libc.so.6
>
> Holy crap lock elision. I have one machine with a haswell chip here, but
> I'll have to USB boot. Is getting an Arch liveimage especially time
> consuming?
>
>
> Not at all; if you download the latest install ISO 
> (https://www.archlinux.org/download/) it is a live CD and you can boot 
> straight into an Arch
> environment. You can do an install if you want, or just run live and install 
> any necessary packages (`pacman -S base-devel gdb`) and go from there.


Okay, seems like I'll have to give it a shot since this still isn't
working well.
  
>
>       https://github.com/dormando/memcached/tree/crawler_fix
>
>       Can you try this? The lock elision might've made my "undefined behavior"
>       mistake of not holding a lock before initially waiting on the condition
>       fatal.
>
>       A further fix might be required, as it's possible someone could kill the
>       do_etc flag before the thread fully starts and it'd drop out with the 
> lock
>       held. That would be an incredible feat though.
>
>
> The good news here is now that we found our way to lock elision, both 64-bit 
> and 32-bit builds (including one straight from git and outside the
> normal packaging build machinery) blow up in the same place. No segfault 
> after applying this patch, so we've made progress.

I love progress.

>       >
>       >       Thanks!
>       >
>       >       >
>       >       >       On the 64bit host, can you try increasing the sleep on 
> t/lru-crawler.t:39
>       >       >       from 3 to 8 and try again? I was trying to be clever 
> but that may not be
>       >       >       working out.
>       >       >
>       >       >
>       >       > Didn't change anything, same two failures with the same 
> output listed.
>       >
>       > I feel like something's a bit different between your two tests. In the
>       > first set, it's definitely not crashing for the 64bit test, but not
>       > working either. Is something weird going on with the second set of 
> tests?
>       > You noted it seems to be running a 32bit binary still.
>       >
>       > I'm willing to ignore the 64-bit failures for now until we figure out 
> the 32-bit ones.
>       >
>       > In any case, I wouldn't blame the cross-compile or toolchain, these 
> have all been built in very clean, single architecture
>       systemd-nspawn chroots.
>
> Thanks, I'm just trying to reason why it's failing in two different ways.
> The initial failure of finding 90 items when it expected 60 is a timing
> glitch, the other ones are this thread crashing the daemon.
>
>
> One machine was an i7 with TSX, thus the lock elision segfaults. The other is 
> a much older Core2 machine. Enough differences there to cause
> problems, especially if we are dealing with threading-type things?

Can you give me a summary of what the core2 machine gave you? I've built
on a core2duo and nehalem i7 and they all work fine. I've also torture
tested it on a brand new 16 core (2x8) xeon.

> On the i7 machine, I think we're still experiencing segfaults. Running just 
> the LRU test; note the two "undef" values showing up again:
>
> $ prove t/lru-crawler.t
> t/lru-crawler.t .. 93/189
> #   Failed test 'slab1 now has 60 used chunks'
> #   at t/lru-crawler.t line 57.
> #          got: '90'
> #     expected: '60'
>
> #   Failed test 'slab1 has 30 reclaims'
> #   at t/lru-crawler.t line 59.
> #          got: '0'
> #     expected: '30'
>
> #   Failed test 'disabled lru crawler'
> #   at t/lru-crawler.t line 69.
> #          got: undef
> #     expected: 'OK
> # '
>
> #   Failed test at t/lru-crawler.t line 72.
> #          got: undef
> #     expected: 'no'
> # Looks like you failed 4 tests of 189.
> t/lru-crawler.t .. Dubious, test returned 4 (wstat 1024, 0x400)
> Failed 4/189 subtests
>
>
> Changing the `sleep 3` to `sleep 8` gives non-deterministic results; two runs 
> in a row were different.
>
> $ prove t/lru-crawler.t
> t/lru-crawler.t .. 93/189
> #   Failed test 'slab1 now has 60 used chunks'
> #   at t/lru-crawler.t line 57.
> #          got: '90'
> #     expected: '60'
>
> #   Failed test 'slab1 has 30 reclaims'
> #   at t/lru-crawler.t line 59.
> #          got: '0'
> #     expected: '30'
>
> #   Failed test 'ifoo29 == 'ok''
> #   at /home/dan/memcached/t/lib/MemcachedTest.pm line 59.
> #          got: undef
> #     expected: 'VALUE ifoo29 0 2
> # ok
> # END
> # '
> t/lru-crawler.t .. Failed 10/189 subtests
>
> Test Summary Report
> -------------------
> t/lru-crawler.t (Wstat: 13 Tests: 182 Failed: 3)
>   Failed tests:  96-97, 182
>   Non-zero wait status: 13
>   Parse errors: Bad plan.  You planned 189 tests but ran 182.
> Files=1, Tests=182,  8 wallclock secs ( 0.03 usr  0.00 sys +  0.04 cusr  0.00 
> csys =  0.07 CPU)
> Result: FAIL
>
>
> $ prove t/lru-crawler.t
> t/lru-crawler.t .. 93/189
> #   Failed test 'slab1 now has 60 used chunks'
> #   at t/lru-crawler.t line 57.
> #          got: '90'
> #     expected: '60'
>
> #   Failed test 'slab1 has 30 reclaims'
> #   at t/lru-crawler.t line 59.
> #          got: '0'
> #     expected: '30'
>
> #   Failed test 'sfoo28 == <undef>'
> #   at /home/dan/memcached/t/lib/MemcachedTest.pm line 53.
> #          got: undef
> #     expected: 'END
> # '
> t/lru-crawler.t .. Failed 11/189 subtests
>
> Test Summary Report
> -------------------
> t/lru-crawler.t (Wstat: 13 Tests: 181 Failed: 3)
>   Failed tests:  96-97, 181
>   Non-zero wait status: 13
>   Parse errors: Bad plan.  You planned 189 tests but ran 181.
> Files=1, Tests=181,  8 wallclock secs ( 0.02 usr  0.00 sys +  0.03 cusr  0.00 
> csys =  0.05 CPU)
> Result: FAIL

Ok. I might still be goofing the lock somewhere. Can you see if memcached
is crashing at all during these tests? Inside the test script you can see
it's just a few raw commands to copy/paste and try yourself.

You can also use an environment variable to start a memcached external to
the tests within a debugger:
    if ($ENV{T_MEMD_USE_DAEMON}) {
        my ($host, $port) = ($ENV{T_MEMD_USE_DAEMON} =~
m/^([^:]+):(\d+)$/);

T_MEMD_USE_DAEMON="127.0.0.1:11211" or something, I think. haven't used
that in a while.

Thanks!

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: 1.4.18

Reply via email to