On Sat, Apr 19, 2014 at 2:18 PM, dormando <[email protected]> wrote:

> On Sat, 19 Apr 2014, Dan McGee wrote:
>
> > On Sat, Apr 19, 2014 at 1:45 PM, dormando <[email protected]> wrote:
> >       > On Sat, Apr 19, 2014 at 12:43 PM, dormando <[email protected]>
> wrote:
> >       >       Well, that learns me for trying to write software without
> the 10+ VM
> >       >       buildbots...
> >       >
> >       >       The i386 one, can you include the output of "stats
> settings", and also
> >       >       manually run: "lru_crawler enable" (or start with -o
> lru_crawler) then run
> >       >       "stats settings" again please? Really weird that it fails
> there, but not
> >       >       the lines before it looking for the "OK" while enabling it.
> >       >
> >       >
> >       > As soon as I type "lru_crawler enable", memcached crashes. I see
> this in dmesg.
> >       >
> >       > [189571.108397] traps: memcached-debug[31776] general protection
> ip:f7749988 sp:f47ff2d8 error:0 in
> >       libpthread-2.19.so[f7739000+18000]
> >       > [189969.840918] traps: memcached-debug[2600] general protection
> ip:7f976510a1c8 sp:7f976254aed8 error:0 in
> >       libpthread-2.19.so[7f97650f9000+18000]
> >       > [195892.554754] traps: memcached-debug[31871] general protection
> ip:f76f0988 sp:f46ff2d8 error:0 in
> >       libpthread-2.19.so[f76e0000+18000]
> >       >
> >       > Starting with "-o lru_crawler" also crashes.
> >       >
> >       > [195977.276379] traps: memcached-debug[2182] general protection
> ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]
> >       >
> >       > This is running both 32 bit and 64 bit executables on the same
> build box; note in the above dmesg output that two of them appear to
> >       be from 32-bit
> >       > processes, and we also see a crash in what looks a lot like a 64
> bit pointer address, if I'm reading this right...
> >
> > Uhh... is your cross compile goofed?
> >
> > Any chance you could start the memcached-debug binary under gdb and then
> > crash it the same way? Get a full stack trace.
> >
> > Thinking if I even have a 32bit host left somewhere to test with... will
> > have to spin up the VM's later, but a stacktrace might be enlightening
> > anyway.
> >
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0xf7dbfb40 (LWP 7)]
> > 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
> > (gdb) bt
> > #0  0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
> > #1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from
> /usr/lib/libpthread.so.0
> > #2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from
> /usr/lib/libpthread.so.0
> > #3  0x08061bfe in item_crawler_thread ()
> > #4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
> > #5  0xf7ead94e in clone () from /usr/lib/libc.so.6
>
> Holy crap lock elision. I have one machine with a haswell chip here, but
> I'll have to USB boot. Is getting an Arch liveimage especially time
> consuming?
>

Not at all; if you download the latest install ISO (
https://www.archlinux.org/download/) it is a live CD and you can boot
straight into an Arch environment. You can do an install if you want, or
just run live and install any necessary packages (`pacman -S base-devel
gdb`) and go from there.


>
> https://github.com/dormando/memcached/tree/crawler_fix
>
> Can you try this? The lock elision might've made my "undefined behavior"
> mistake of not holding a lock before initially waiting on the condition
> fatal.
>
> A further fix might be required, as it's possible someone could kill the
> do_etc flag before the thread fully starts and it'd drop out with the lock
> held. That would be an incredible feat though.
>

The good news here is now that we found our way to lock elision, both
64-bit and 32-bit builds (including one straight from git and outside the
normal packaging build machinery) blow up in the same place. No segfault
after applying this patch, so we've made progress.


> >
> >       Thanks!
> >
> >       >
> >       >       On the 64bit host, can you try increasing the sleep on
> t/lru-crawler.t:39
> >       >       from 3 to 8 and try again? I was trying to be clever but
> that may not be
> >       >       working out.
> >       >
> >       >
> >       > Didn't change anything, same two failures with the same output
> listed.
> >
> > I feel like something's a bit different between your two tests. In the
> > first set, it's definitely not crashing for the 64bit test, but not
> > working either. Is something weird going on with the second set of tests?
> > You noted it seems to be running a 32bit binary still.
> >
> > I'm willing to ignore the 64-bit failures for now until we figure out
> the 32-bit ones.
> >
> > In any case, I wouldn't blame the cross-compile or toolchain, these have
> all been built in very clean, single architecture systemd-nspawn chroots.
>
> Thanks, I'm just trying to reason why it's failing in two different ways.
> The initial failure of finding 90 items when it expected 60 is a timing
> glitch, the other ones are this thread crashing the daemon.
>

One machine was an i7 with TSX, thus the lock elision segfaults. The other
is a much older Core2 machine. Enough differences there to cause problems,
especially if we are dealing with threading-type things?

On the i7 machine, I think we're still experiencing segfaults. Running just
the LRU test; note the two "undef" values showing up again:

$ prove t/lru-crawler.t
t/lru-crawler.t .. 93/189
#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'

#   Failed test 'disabled lru crawler'
#   at t/lru-crawler.t line 69.
#          got: undef
#     expected: 'OK
# '

#   Failed test at t/lru-crawler.t line 72.
#          got: undef
#     expected: 'no'
# Looks like you failed 4 tests of 189.
t/lru-crawler.t .. Dubious, test returned 4 (wstat 1024, 0x400)
Failed 4/189 subtests


Changing the `sleep 3` to `sleep 8` gives non-deterministic results; two
runs in a row were different.

$ prove t/lru-crawler.t
t/lru-crawler.t .. 93/189
#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'

#   Failed test 'ifoo29 == 'ok''
#   at /home/dan/memcached/t/lib/MemcachedTest.pm line 59.
#          got: undef
#     expected: 'VALUE ifoo29 0 2
# ok
# END
# '
t/lru-crawler.t .. Failed 10/189 subtests

Test Summary Report
-------------------
t/lru-crawler.t (Wstat: 13 Tests: 182 Failed: 3)
  Failed tests:  96-97, 182
  Non-zero wait status: 13
  Parse errors: Bad plan.  You planned 189 tests but ran 182.
Files=1, Tests=182,  8 wallclock secs ( 0.03 usr  0.00 sys +  0.04 cusr
0.00 csys =  0.07 CPU)
Result: FAIL


$ prove t/lru-crawler.t
t/lru-crawler.t .. 93/189
#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'

#   Failed test 'sfoo28 == <undef>'
#   at /home/dan/memcached/t/lib/MemcachedTest.pm line 53.
#          got: undef
#     expected: 'END
# '
t/lru-crawler.t .. Failed 11/189 subtests

Test Summary Report
-------------------
t/lru-crawler.t (Wstat: 13 Tests: 181 Failed: 3)
  Failed tests:  96-97, 181
  Non-zero wait status: 13
  Parse errors: Bad plan.  You planned 189 tests but ran 181.
Files=1, Tests=181,  8 wallclock secs ( 0.02 usr  0.00 sys +  0.03 cusr
0.00 csys =  0.05 CPU)
Result: FAIL

-Dan

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to