Here is the stack trace on a successfull run, borrowed from the unit tests, to 
confirm the code path : http://tracker.ceph.com/issues/7914#note-27

On 02/04/2014 19:51, Loic Dachary wrote:
> Given the parameters to jerasure_matrix_dotprod the code path should be:
> 
>    https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L338 (because 
> nbytes == 2048)
>    https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 
>    https://github.com/ceph/gf-complete/blob/v1-ceph/src/gf_w32.c#L569 
> (because INTEL_SSE4_PCLMUL has been used at compile time and the CPUID 
> detected at runtime has the required features as selected in 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L49
>  )
>    
> what should happen after that ? h->prim_poly will select something but what 
> exactly... Could it be that the lack of stack means 
> https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 references a 
> NULL or invalid gfp_array[32] ? Or could it be that src/dest pointers are 
> pointing to invalid memory ?
> 
> Bugs that can't be reproduced are the best ;-)
>    
> On 02/04/2014 19:35, Loic Dachary wrote:> Hi Kevin,
>>
>> In the context of http://tracker.ceph.com/issues/7914 we're trying to figure 
>> out why jerasure dumps core. We don't know how to reproduce it yet (ran 
>> dozens of identical tests suites with no such crash in the past few days, 
>> which is to be expected for rare bugs because the test suite introduces 
>> random errors / failures on purpose). 
>>
>> The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but 
>> the relevant part is here:
>>
>> #0  0x00007f4756779b7b in raise (sig=<optimized out>) at 
>> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>> #1  0x0000000000981b4e in reraise_fatal (signum=11) at 
>> global/signal_handler.cc:59
>> #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
>> #3  <signal handler called>
>> #4  0x0000000000000000 in ?? ()
>> #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, 
>> matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, 
>> data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, 
>>     size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
>> #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, 
>> matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, 
>> coding_ptrs=0x7f4741ec7a10, size=2048)
>>     at erasure-code/jerasure/jerasure/src/jerasure.c:310
>> ...
>>
>> Note that this jerasure/gf-complete combination has been compiled with 
>> SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are 
>> jerasure v2 and gf-complete v1, only slightly modified as found in 
>> https://github.com/ceph/jerasure/tree/v2-ceph and 
>> https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a 
>> pending pull request under https://bitbucket.org/jimplank/gf-complete 
>> https://bitbucket.org/jimplank/jerasure, nothing you've not seen before). 
>>
>> #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
>>
>> and then it dives into gf-complete and most probably destroyed part of the 
>> stack when corrupting memory. I'll be chasing this tomorrow. If you have a 
>> brilliant idea on why that happens, I'll take it ;-) 
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to