Hi Kevin,

In galois.c gfp_array is a global variable . If galois_w16_region_xor is called 
from two different threads, there is a race condition. 

   http://tracker.ceph.com/issues/7914#note-39

If you agree that it's a plausible explanation to the crashes, I'll start work 
to improve jerasure thread safety.

Cheers

On 07/04/2014 20:29, Loic Dachary wrote:
> [re-adding the list for the record]
> 
> On 07/04/2014 19:53, Kevin Greenan wrote:> Hey Loic,
>>
>> BTW, you can get an illegal instruction fault if you are calling an 
>> intrinsic that is not supported on a particular platform.  Is the code being 
>> compiled on a platform that is different than the machines in your test 
>> harness?
>>
> 
> The plugin is compiled with three kinds of flags:
> 
> https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/Makefile.am#L50
> 
> at runtime the appropriate binary is loaded depending on the CPU features
> 
> https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L42
> 
> and the logs confirm that jerasure_sse4 is used in this particular case. All 
> tests were run on machines tested to have the required CPU features in
> 
> https://github.com/ceph/ceph/blob/firefly/src/arch/intel.c#L10
> 
> Do you see something missing ?
> 
> Cheers
> 
>> -kevin
>>
>>
>> On Sun, Apr 6, 2014 at 12:06 PM, Loic Dachary <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>>
>>
>>     On 06/04/2014 18:28, Kevin Greenan wrote:
>>     > Hey Loic,
>>     >
>>     > Did this stuff start happening after a specific commit (or commits)?  
>> I see this bug was opened 6 days ago and some changes to your fork as of 7 
>> days ago...
>>     >
>>     > Or is this the first time you have run these tests with the new 
>> Jerasure backend?
>>
>>     It's the first time we run tests with gf-complete / jerasure optimized 
>> (i.e. all flags from https://github.com/ceph/ceph/blob/master/m4/ax_intel.m4 
>> are set because the compiler knows how and it's targeting x86_64). Before 
>> that and during three or four weeks we ran jerasure / gf-complete without 
>> any optimization. Before that we ran the previous jerasure version without 
>> gf-complete.
>>
>>     Cheers
>>
>>     >
>>     > Thanks,
>>     > -kevin
>>     >
>>     >
>>     > On Apr 6, 2014, at 3:12 AM, Loic Dachary wrote:
>>     >
>>     >> Hi,
>>     >>
>>     >> An illegal instruction this time 
>> http://tracker.ceph.com/issues/7914#note-31 . Since the workload is slightly 
>> different, I'm trying to run it 30 times and see if that triggers the 
>> problem.
>>     >>
>>     >> Cheers
>>     >>
>>     >> On 02/04/2014 20:15, Kevin Greenan wrote:
>>     >>> OK, it looks like this happens when the GF backend is first 
>> initialized (unless, like Loic pointed out, something is corrupted).
>>     >>>
>>     >>> Is this consistently happening for carry-free multiply and w=32 
>> (i.e. gf_w32_cfm_init)?
>>     >>>
>>     >>> Can you send me a core + binary, so I can dig in gdb?
>>     >>>
>>     >>> -kevin
>>     >>>
>>     >>>
>>     >>> On Wed, Apr 2, 2014 at 11:01 AM, Sage Weil <[email protected] 
>> <mailto:[email protected]> <mailto:[email protected] 
>> <mailto:[email protected]>>> wrote:
>>     >>>
>>     >>>    On Wed, 2 Apr 2014, Loic Dachary wrote:
>>     >>>>
>>     >>>>
>>     >>>> On 02/04/2014 19:44, Kevin Greenan wrote:
>>     >>>>> Hey Loic,
>>     >>>>>
>>     >>>>> Are you ensuring that Jerasure (actually gf-complete) is getting 
>> memory buffers aligned on 16-byte boundaries?  Without looking too deep, 
>> that is the first thing I would check.
>>     >>>>>
>>     >>>>
>>     >>>> Yes
>>     >>>>
>>     >>>> 
>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
>>     >>>> 
>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
>>     >>>> 
>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
>>     >>>> 
>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108
>>     >>>>
>>     >>>
>>     >>>    In this case they are 2K aligned:
>>     >>>
>>     >>>    (gdb) p data_ptrs[0]
>>     >>>    $1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>>     >>>    (gdb) p data_ptrs[1]
>>     >>>    $2 = 0x3e46800 'z' <repeats 200 times>...
>>     >>>    (gdb) p coding_ptrs[0]
>>     >>>    $3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>>     >>>
>>     >>>    sage
>>     >>>
>>     >>>> I'll re-read this logic tomorrow just to be sure.
>>     >>>>
>>     >>>> Cheers
>>     >>>>
>>     >>>>> I can have a deeper look later today or tomorrow.
>>     >>>>>
>>     >>>>> -kevin
>>     >>>>>
>>     >>>>>
>>     >>>>> On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <[email protected] 
>> <mailto:[email protected]> <mailto:[email protected] 
>> <mailto:[email protected]>> <mailto:[email protected] 
>> <mailto:[email protected]> <mailto:[email protected] 
>> <mailto:[email protected]>>>> wrote:
>>     >>>>>
>>     >>>>>    Hi Kevin,
>>     >>>>>
>>     >>>>>    In the context of http://tracker.ceph.com/issues/7914 we're 
>> trying to figure out why jerasure dumps core. We don't know how to reproduce 
>> it yet (ran dozens of identical tests suites with no such crash in the past 
>> few days, which is to be expected for rare bugs because the test suite 
>> introduces random errors / failures on purpose).
>>     >>>>>
>>     >>>>>    The full stack trace is at 
>> http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
>>     >>>>>
>>     >>>>>    #0  0x00007f4756779b7b in raise (sig=<optimized out>) at 
>> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>>     >>>>>    #1  0x0000000000981b4e in reraise_fatal (signum=11) at 
>> global/signal_handler.cc:59
>>     >>>>>    #2  handle_fatal_signal (signum=11) at 
>> global/signal_handler.cc:105
>>     >>>>>    #3  <signal handler called>
>>     >>>>>    #4  0x0000000000000000 in ?? ()
>>     >>>>>    #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, 
>> matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, 
>> data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
>>     >>>>>        size=2048) at 
>> erasure-code/jerasure/jerasure/src/jerasure.c:607
>>     >>>>>    #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, 
>> w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, 
>> coding_ptrs=0x7f4741ec7a10, size=2048)
>>     >>>>>        at erasure-code/jerasure/jerasure/src/jerasure.c:310
>>     >>>>>    ...
>>     >>>>>
>>     >>>>>    Note that this jerasure/gf-complete combination has been 
>> compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags 
>> activated. These are jerasure v2 and gf-complete v1, only slightly modified 
>> as found in https://github.com/ceph/jerasure/tree/v2-ceph and 
>> https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a 
>> pending pull request under https://bitbucket.org/jimplank/gf-complete 
>> https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
>>     >>>>>
>>     >>>>>    #5 is 
>> https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
>>     >>>>>
>>     >>>>>    and then it dives into gf-complete and most probably destroyed 
>> part of the stack when corrupting memory. I'll be chasing this tomorrow. If 
>> you have a brilliant idea on why that happens, I'll take it ;-)
>>     >>>>>
>>     >>>>>    Cheers
>>     >>>>>
>>     >>>>>    --
>>     >>>>>    Loïc Dachary, Artisan Logiciel Libre
>>     >>>>>
>>     >>>>>
>>     >>>>
>>     >>>> --
>>     >>>> Loïc Dachary, Artisan Logiciel Libre
>>     >>>>
>>     >>>>
>>     >>>
>>     >>>
>>     >>
>>     >> --
>>     >> Loïc Dachary, Artisan Logiciel Libre
>>     >>
>>     >
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to