On Sat, Apr 19, 2014 at 5:23 PM, dormando <[email protected]> wrote:

> > On Sat, Apr 19, 2014 at 2:45 PM, dormando <[email protected]> wrote:
> >       > One machine was an i7 with TSX, thus the lock elision segfaults.
> The other is a much older Core2 machine. Enough differences there to
> >       cause
> >       > problems, especially if we are dealing with threading-type
> things?
> >
> > Can you give me a summary of what the core2 machine gave you? I've built
> > on a core2duo and nehalem i7 and they all work fine. I've also torture
> > tested it on a brand new 16 core (2x8) xeon.
> >
> >
> > I ran the test suite on the Core2 a number of times (at least 5).
> Sometimes it completes without failure, other times I still get these two
> > failures. This is with `sleep 3` changed to `sleep 8`.
> >
> > #   Failed test 'slab1 now has 60 used chunks'
> > #   at t/lru-crawler.t line 57.
> > #          got: '90'
> > #     expected: '60'
> >
> > #   Failed test 'slab1 has 30 reclaims'
> > #   at t/lru-crawler.t line 59.
> > #          got: '0'
> > #     expected: '30'
> > # Looks like you failed 2 tests of 189.
> > t/lru-crawler.t ...... Dubious, test returned 2 (wstat 512, 0x200)
> > Failed 2/189 subtests
>
> Makes no goddamn sense. Maybe the fix below will.. fix it.
>

Once I wrapped my head around it, figured this one out. This cheap patch
"fixes" the test, although I'm not sure it is the best actual solution.
Because we don't set the lru_crawler_running flag on the main thread, but
in the LRU thread itself, we have a race condition here. pthread_create()
is by no means required to actually start the thread right away or schedule
it, so the test itself asks too quickly if the LRU crawler is running,
before the auxiliary thread has had the time to mark it as running. The
sleep ensures we at least give that thread time to start.

(Debugged by way of adding a print to STDERR statement in the while(1)
loop. The only time I saw the test actually pass was when that loop caught
and repeated itself for a while. It failed when it only ran once, which
would make sense if the thread hadn't actually set the flag yet.)

diff --git a/t/lru-crawler.t b/t/lru-crawler.t
index 8c82623..9b1c7e7 100644
--- a/t/lru-crawler.t
+++ b/t/lru-crawler.t
@@ -47,6 +47,9 @@ is(scalar <$sock>, "OK\r\n", "enabled lru crawler");

 print $sock "lru_crawler crawl 1\r\n";
 is(scalar <$sock>, "OK\r\n", "kicked lru crawler");
+
+sleep 1;
+
 while (1) {
     my $stats = mem_stats($sock);
     last unless $stats->{lru_crawler_running};


>
> >
> >       > On the i7 machine, I think we're still experiencing segfaults.
> Running just the LRU test; note the two "undef" values showing up
> >       again:
> >       >
> >
> > Ok. I might still be goofing the lock somewhere. Can you see if memcached
> > is crashing at all during these tests? Inside the test script you can see
> > it's just a few raw commands to copy/paste and try yourself.
> >
> > You can also use an environment variable to start a memcached external to
> > the tests within a debugger:
> >     if ($ENV{T_MEMD_USE_DAEMON}) {
> >         my ($host, $port) = ($ENV{T_MEMD_USE_DAEMON} =~
> > m/^([^:]+):(\d+)$/);
> >
> > T_MEMD_USE_DAEMON="127.0.0.1:11211" or something, I think. haven't used
> > that in a while.
> >
> >
> > Simple repro, running standalone, no other commands have been issued:
> > $ nc localhost 11211
> > lru_crawler enable
> > OK
> > lru_crawler crawl 1
> > OK
> > lru_crawler disable
> >
> > SIGSEGV happens, here is the backtrace (surprised to see it in
> start_thread...):
> >
> > (gdb) bt
> > #0  0x00007ffff79881c8 in __lll_unlock_elision () from
> /usr/lib/libpthread.so.0
> > #1  0x00007ffff7982fc7 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /usr/lib/libpthread.so.0
> > #2  0x0000000000414f61 in item_crawler_thread (arg=<optimized out>) at
> items.c:771
> > #3  0x00007ffff797f0a2 in start_thread () from /usr/lib/libpthread.so.0
> > #4  0x00007ffff76b4d1d in clone () from /usr/lib/libc.so.6
> >
> > It does NOT segfault if you run enable immediately followed by disable,
> with no `crawl 1` in between.
>
> Good lord I suck at this. I really wish I could make that
> pthread_cond_wait "undefined" behavior actually error out so I don't test
> this on 3+ platforms and then have it error out elsewhere :/
>
> Just force-pushed this:
> https://github.com/dormando/memcached/tree/crawler_fix
>
> At some point I'd refactored it and didn't push the unlock far enough
> south. Now it actually unlocks when it's stopping the thread...
>

This makes things happy on the TRX-using machine. Awesome!


>
> Please try again. Wonder if I can somehow fund getting a haswell NUC
> bought just for my build VM's. Will TRX work within a VM..?
>
I don't know on this one.


> None of the other places I intend to run build VM's have lock elision...
>
> Thanks for your patience on this. It's been a huge help!
>
> --
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "memcached" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/memcached/Tw6t_W-a6Xc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to