On Wed, 2 Apr 2014, Loic Dachary wrote:
> 
> 
> On 02/04/2014 19:44, Kevin Greenan wrote:
> > Hey Loic,
> > 
> > Are you ensuring that Jerasure (actually gf-complete) is getting memory 
> > buffers aligned on 16-byte boundaries?  Without looking too deep, that is 
> > the first thing I would check.
> > 
> 
> Yes
> 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108
> 

In this case they are 2K aligned:

(gdb) p data_ptrs[0]
$1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!"
(gdb) p data_ptrs[1]
$2 = 0x3e46800 'z' <repeats 200 times>...
(gdb) p coding_ptrs[0]
$3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!"

sage

> I'll re-read this logic tomorrow just to be sure.
> 
> Cheers
> 
> > I can have a deeper look later today or tomorrow.
> > 
> > -kevin
> > 
> > 
> > On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <[email protected] 
> > <mailto:[email protected]>> wrote:
> > 
> >     Hi Kevin,
> > 
> >     In the context of http://tracker.ceph.com/issues/7914 we're trying to 
> > figure out why jerasure dumps core. We don't know how to reproduce it yet 
> > (ran dozens of identical tests suites with no such crash in the past few 
> > days, which is to be expected for rare bugs because the test suite 
> > introduces random errors / failures on purpose).
> > 
> >     The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 
> > but the relevant part is here:
> > 
> >     #0  0x00007f4756779b7b in raise (sig=<optimized out>) at 
> > ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
> >     #1  0x0000000000981b4e in reraise_fatal (signum=11) at 
> > global/signal_handler.cc:59
> >     #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
> >     #3  <signal handler called>
> >     #4  0x0000000000000000 in ?? ()
> >     #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, 
> > matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, 
> > data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
> >         size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
> >     #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, 
> > matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, 
> > coding_ptrs=0x7f4741ec7a10, size=2048)
> >         at erasure-code/jerasure/jerasure/src/jerasure.c:310
> >     ...
> > 
> >     Note that this jerasure/gf-complete combination has been compiled with 
> > SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are 
> > jerasure v2 and gf-complete v1, only slightly modified as found in 
> > https://github.com/ceph/jerasure/tree/v2-ceph and 
> > https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a 
> > pending pull request under https://bitbucket.org/jimplank/gf-complete 
> > https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
> > 
> >     #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
> > 
> >     and then it dives into gf-complete and most probably destroyed part of 
> > the stack when corrupting memory. I'll be chasing this tomorrow. If you 
> > have a brilliant idea on why that happens, I'll take it ;-)
> > 
> >     Cheers
> > 
> >     --
> >     Loïc Dachary, Artisan Logiciel Libre
> > 
> > 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 
> 

Reply via email to