[re-adding the list for the record] On 07/04/2014 19:53, Kevin Greenan wrote:> Hey Loic, > > BTW, you can get an illegal instruction fault if you are calling an intrinsic > that is not supported on a particular platform. Is the code being compiled > on a platform that is different than the machines in your test harness? >
The plugin is compiled with three kinds of flags: https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/Makefile.am#L50 at runtime the appropriate binary is loaded depending on the CPU features https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L42 and the logs confirm that jerasure_sse4 is used in this particular case. All tests were run on machines tested to have the required CPU features in https://github.com/ceph/ceph/blob/firefly/src/arch/intel.c#L10 Do you see something missing ? Cheers > -kevin > > > On Sun, Apr 6, 2014 at 12:06 PM, Loic Dachary <[email protected] > <mailto:[email protected]>> wrote: > > > > On 06/04/2014 18:28, Kevin Greenan wrote: > > Hey Loic, > > > > Did this stuff start happening after a specific commit (or commits)? I > see this bug was opened 6 days ago and some changes to your fork as of 7 days > ago... > > > > Or is this the first time you have run these tests with the new > Jerasure backend? > > It's the first time we run tests with gf-complete / jerasure optimized > (i.e. all flags from https://github.com/ceph/ceph/blob/master/m4/ax_intel.m4 > are set because the compiler knows how and it's targeting x86_64). Before > that and during three or four weeks we ran jerasure / gf-complete without any > optimization. Before that we ran the previous jerasure version without > gf-complete. > > Cheers > > > > > Thanks, > > -kevin > > > > > > On Apr 6, 2014, at 3:12 AM, Loic Dachary wrote: > > > >> Hi, > >> > >> An illegal instruction this time > http://tracker.ceph.com/issues/7914#note-31 . Since the workload is slightly > different, I'm trying to run it 30 times and see if that triggers the problem. > >> > >> Cheers > >> > >> On 02/04/2014 20:15, Kevin Greenan wrote: > >>> OK, it looks like this happens when the GF backend is first > initialized (unless, like Loic pointed out, something is corrupted). > >>> > >>> Is this consistently happening for carry-free multiply and w=32 (i.e. > gf_w32_cfm_init)? > >>> > >>> Can you send me a core + binary, so I can dig in gdb? > >>> > >>> -kevin > >>> > >>> > >>> On Wed, Apr 2, 2014 at 11:01 AM, Sage Weil <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >>> > >>> On Wed, 2 Apr 2014, Loic Dachary wrote: > >>>> > >>>> > >>>> On 02/04/2014 19:44, Kevin Greenan wrote: > >>>>> Hey Loic, > >>>>> > >>>>> Are you ensuring that Jerasure (actually gf-complete) is getting > memory buffers aligned on 16-byte boundaries? Without looking too deep, that > is the first thing I would check. > >>>>> > >>>> > >>>> Yes > >>>> > >>>> > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32 > >>>> > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242 > >>>> > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65 > >>>> > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108 > >>>> > >>> > >>> In this case they are 2K aligned: > >>> > >>> (gdb) p data_ptrs[0] > >>> $1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!" > >>> (gdb) p data_ptrs[1] > >>> $2 = 0x3e46800 'z' <repeats 200 times>... > >>> (gdb) p coding_ptrs[0] > >>> $3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!" > >>> > >>> sage > >>> > >>>> I'll re-read this logic tomorrow just to be sure. > >>>> > >>>> Cheers > >>>> > >>>>> I can have a deeper look later today or tomorrow. > >>>>> > >>>>> -kevin > >>>>> > >>>>> > >>>>> On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <[email protected] > <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>>> wrote: > >>>>> > >>>>> Hi Kevin, > >>>>> > >>>>> In the context of http://tracker.ceph.com/issues/7914 we're > trying to figure out why jerasure dumps core. We don't know how to reproduce > it yet (ran dozens of identical tests suites with no such crash in the past > few days, which is to be expected for rare bugs because the test suite > introduces random errors / failures on purpose). > >>>>> > >>>>> The full stack trace is at > http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here: > >>>>> > >>>>> #0 0x00007f4756779b7b in raise (sig=<optimized out>) at > ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 > >>>>> #1 0x0000000000981b4e in reraise_fatal (signum=11) at > global/signal_handler.cc:59 > >>>>> #2 handle_fatal_signal (signum=11) at > global/signal_handler.cc:105 > >>>>> #3 <signal handler called> > >>>>> #4 0x0000000000000000 in ?? () > >>>>> #5 0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, > matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, > data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, > >>>>> size=2048) at > erasure-code/jerasure/jerasure/src/jerasure.c:607 > >>>>> #6 0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, > matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, > size=2048) > >>>>> at erasure-code/jerasure/jerasure/src/jerasure.c:310 > >>>>> ... > >>>>> > >>>>> Note that this jerasure/gf-complete combination has been > compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. > These are jerasure v2 and gf-complete v1, only slightly modified as found in > https://github.com/ceph/jerasure/tree/v2-ceph and > https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a > pending pull request under https://bitbucket.org/jimplank/gf-complete > https://bitbucket.org/jimplank/jerasure, nothing you've not seen before). > >>>>> > >>>>> #5 is > https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607 > >>>>> > >>>>> and then it dives into gf-complete and most probably destroyed > part of the stack when corrupting memory. I'll be chasing this tomorrow. If > you have a brilliant idea on why that happens, I'll take it ;-) > >>>>> > >>>>> Cheers > >>>>> > >>>>> -- > >>>>> Loïc Dachary, Artisan Logiciel Libre > >>>>> > >>>>> > >>>> > >>>> -- > >>>> Loïc Dachary, Artisan Logiciel Libre > >>>> > >>>> > >>> > >>> > >> > >> -- > >> Loïc Dachary, Artisan Logiciel Libre > >> > > > > -- > Loïc Dachary, Artisan Logiciel Libre > > -- Loïc Dachary, Artisan Logiciel Libre
signature.asc
Description: OpenPGP digital signature
