[re-adding the list for the record]

On 07/04/2014 19:53, Kevin Greenan wrote:> Hey Loic,
> 
> BTW, you can get an illegal instruction fault if you are calling an intrinsic 
> that is not supported on a particular platform.  Is the code being compiled 
> on a platform that is different than the machines in your test harness?
> 

The plugin is compiled with three kinds of flags:

https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/Makefile.am#L50

at runtime the appropriate binary is loaded depending on the CPU features

https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L42

and the logs confirm that jerasure_sse4 is used in this particular case. All 
tests were run on machines tested to have the required CPU features in

https://github.com/ceph/ceph/blob/firefly/src/arch/intel.c#L10

Do you see something missing ?

Cheers

> -kevin
> 
> 
> On Sun, Apr 6, 2014 at 12:06 PM, Loic Dachary <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> 
>     On 06/04/2014 18:28, Kevin Greenan wrote:
>     > Hey Loic,
>     >
>     > Did this stuff start happening after a specific commit (or commits)?  I 
> see this bug was opened 6 days ago and some changes to your fork as of 7 days 
> ago...
>     >
>     > Or is this the first time you have run these tests with the new 
> Jerasure backend?
> 
>     It's the first time we run tests with gf-complete / jerasure optimized 
> (i.e. all flags from https://github.com/ceph/ceph/blob/master/m4/ax_intel.m4 
> are set because the compiler knows how and it's targeting x86_64). Before 
> that and during three or four weeks we ran jerasure / gf-complete without any 
> optimization. Before that we ran the previous jerasure version without 
> gf-complete.
> 
>     Cheers
> 
>     >
>     > Thanks,
>     > -kevin
>     >
>     >
>     > On Apr 6, 2014, at 3:12 AM, Loic Dachary wrote:
>     >
>     >> Hi,
>     >>
>     >> An illegal instruction this time 
> http://tracker.ceph.com/issues/7914#note-31 . Since the workload is slightly 
> different, I'm trying to run it 30 times and see if that triggers the problem.
>     >>
>     >> Cheers
>     >>
>     >> On 02/04/2014 20:15, Kevin Greenan wrote:
>     >>> OK, it looks like this happens when the GF backend is first 
> initialized (unless, like Loic pointed out, something is corrupted).
>     >>>
>     >>> Is this consistently happening for carry-free multiply and w=32 (i.e. 
> gf_w32_cfm_init)?
>     >>>
>     >>> Can you send me a core + binary, so I can dig in gdb?
>     >>>
>     >>> -kevin
>     >>>
>     >>>
>     >>> On Wed, Apr 2, 2014 at 11:01 AM, Sage Weil <[email protected] 
> <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote:
>     >>>
>     >>>    On Wed, 2 Apr 2014, Loic Dachary wrote:
>     >>>>
>     >>>>
>     >>>> On 02/04/2014 19:44, Kevin Greenan wrote:
>     >>>>> Hey Loic,
>     >>>>>
>     >>>>> Are you ensuring that Jerasure (actually gf-complete) is getting 
> memory buffers aligned on 16-byte boundaries?  Without looking too deep, that 
> is the first thing I would check.
>     >>>>>
>     >>>>
>     >>>> Yes
>     >>>>
>     >>>> 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
>     >>>> 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
>     >>>> 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
>     >>>> 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108
>     >>>>
>     >>>
>     >>>    In this case they are 2K aligned:
>     >>>
>     >>>    (gdb) p data_ptrs[0]
>     >>>    $1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>     >>>    (gdb) p data_ptrs[1]
>     >>>    $2 = 0x3e46800 'z' <repeats 200 times>...
>     >>>    (gdb) p coding_ptrs[0]
>     >>>    $3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>     >>>
>     >>>    sage
>     >>>
>     >>>> I'll re-read this logic tomorrow just to be sure.
>     >>>>
>     >>>> Cheers
>     >>>>
>     >>>>> I can have a deeper look later today or tomorrow.
>     >>>>>
>     >>>>> -kevin
>     >>>>>
>     >>>>>
>     >>>>> On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <[email protected] 
> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> 
> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>>> wrote:
>     >>>>>
>     >>>>>    Hi Kevin,
>     >>>>>
>     >>>>>    In the context of http://tracker.ceph.com/issues/7914 we're 
> trying to figure out why jerasure dumps core. We don't know how to reproduce 
> it yet (ran dozens of identical tests suites with no such crash in the past 
> few days, which is to be expected for rare bugs because the test suite 
> introduces random errors / failures on purpose).
>     >>>>>
>     >>>>>    The full stack trace is at 
> http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
>     >>>>>
>     >>>>>    #0  0x00007f4756779b7b in raise (sig=<optimized out>) at 
> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>     >>>>>    #1  0x0000000000981b4e in reraise_fatal (signum=11) at 
> global/signal_handler.cc:59
>     >>>>>    #2  handle_fatal_signal (signum=11) at 
> global/signal_handler.cc:105
>     >>>>>    #3  <signal handler called>
>     >>>>>    #4  0x0000000000000000 in ?? ()
>     >>>>>    #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, 
> matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, 
> data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
>     >>>>>        size=2048) at 
> erasure-code/jerasure/jerasure/src/jerasure.c:607
>     >>>>>    #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, 
> matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, 
> size=2048)
>     >>>>>        at erasure-code/jerasure/jerasure/src/jerasure.c:310
>     >>>>>    ...
>     >>>>>
>     >>>>>    Note that this jerasure/gf-complete combination has been 
> compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. 
> These are jerasure v2 and gf-complete v1, only slightly modified as found in 
> https://github.com/ceph/jerasure/tree/v2-ceph and 
> https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a 
> pending pull request under https://bitbucket.org/jimplank/gf-complete 
> https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
>     >>>>>
>     >>>>>    #5 is 
> https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
>     >>>>>
>     >>>>>    and then it dives into gf-complete and most probably destroyed 
> part of the stack when corrupting memory. I'll be chasing this tomorrow. If 
> you have a brilliant idea on why that happens, I'll take it ;-)
>     >>>>>
>     >>>>>    Cheers
>     >>>>>
>     >>>>>    --
>     >>>>>    Loïc Dachary, Artisan Logiciel Libre
>     >>>>>
>     >>>>>
>     >>>>
>     >>>> --
>     >>>> Loïc Dachary, Artisan Logiciel Libre
>     >>>>
>     >>>>
>     >>>
>     >>>
>     >>
>     >> --
>     >> Loïc Dachary, Artisan Logiciel Libre
>     >>
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to