Hello fellow Racketeers,

during my research into how Racket can be used as generic software
rendering platform, I've hit some limits of Racket's (native) thread
handling. Once I started getting SIGSEGVs, I strongly suspected I am
doing too much unsafe operations - and to be honest, that was true.
There was one off-by-one memory access :).

But that was easy to resolve - I just switched to safe/contracted
versions of everything and found and fixed the bug. But I still got
occasional SIGSEGV. So I dug even deeper (during last two months I've
read most of the JIT inlining code) than before and noticed that the
crashes disappear when I refrain from calling bytes-set! in parallel
using futures.

So I started creating a minimal-crashing-example. At first, I failed
miserably. Just filling a byte array over and over again, I was unable
to reproduce the crash. But then I realized, that in my application,
threads come to play and that might be the case. And suddenly, creating
MCE was really easy:

Create new eventspace using parameterize/make-eventspace, put the actual
code in application thread (thread ...) and make the main thread wait
for this application thread using thread-wait. Before starting the
application thread, I create a simple window, bitmap and a canvas, that
I keep redrawing using refresh-now after each iteration. Funny thing is,
now it keeps crashing even without actually modifying the bitmap in
question. All I need to do is to mess with some byte array in 8 threads.
Sometimes it takes a minute on my computer before it crashes, sometimes
it needs more, but it eventually crashes pretty consistently.

And it is just 60 lines of code:

#lang racket/gui

(require racket/future racket/fixnum racket/cmdline)

(define width 800)
(define height 600)

(define framebuffer (make-fxvector (* width height)))
(define pixels (make-bytes (* width height 4)))

(define max-depth 0)

(command-line
 #:once-each
 (("-d" "--depth") d "Futures binary partitioning depth" (set! max-depth
(string->number d))))

(file-stream-buffer-mode (current-output-port) 'none)

(parameterize ((current-eventspace (make-eventspace)))
  (define win (new frame%
                   (label "test")
                   (width width)
                   (height height)))
  (define bmp (make-bitmap width height))
  (define canvas (new canvas%
                      (parent win)
                      (paint-callback
                       (λ (c dc)
                         (send dc draw-bitmap bmp 0 0)))
                      ))

  (define (single-run)
    (define (do-bflip start end (depth 0))
      (cond ((fx< depth max-depth)
             (define cnt (fx- end start))
             (define cnt2 (fxrshift cnt 1))
             (define mid (fx+ start cnt2))
             (let ((f (future
                       (λ ()
                         (do-bflip start mid (fx+ depth 1))))))
               (do-bflip mid end (fx+ depth 1))
               (touch f)))
            (else
             (for ((i (in-range start end)))
               (define c (fxvector-ref framebuffer i))
               (bytes-set! pixels (+ (* i 4) 0) #xff)
               (bytes-set! pixels (+ (* i 4) 1) (fxand (fxrshift c 16)
#xff))
               (bytes-set! pixels (+ (* i 4) 2) (fxand (fxrshift c 8) #xff))
               (bytes-set! pixels (+ (* i 4) 3) (fxand c #xff))))))
    (do-bflip 0 (* width height))
    (send canvas refresh-now))
(send win show #t)

  (define appthread
    (thread
     (λ ()
       (let loop ()
         (single-run)
         (loop)))))
  (thread-wait appthread))

Note: the code is deliberately de-optimized to highlight the problem.
Not even mentioning CPU cache coherence here....

Running this from command-line, I can adjust the number of threads.
Running with 8 threads:

$ time racket crash.rkt -d 3
SIGSEGV MAPERR si_code 1 fault on addr (nil)
Aborted (core dumped)

real    1m18,162s
user    7m11,936s
sys     0m3,832s
$ time racket crash.rkt -d 3
SIGSEGV MAPERR si_code 1 fault on addr (nil)
Aborted (core dumped)

real    3m44,005s
user    20m10,920s
sys     0m11,702s
$ time racket crash.rkt -d 3
SIGSEGV MAPERR si_code 1 fault on addr (nil)
Aborted (core dumped)

real    2m1,650s
user    10m58,392s
sys     0m6,445s
$ time racket crash.rkt -d 3
SIGSEGV MAPERR si_code 1 fault on addr (nil)
Aborted (core dumped)

real    8m8,666s
user    45m52,359s
sys     0m25,184s
$

With 4 threads it didn't crash even after quite some time:

$ time racket crash.rkt -d 2
^Cuser break
  context...:
   "crash.rkt": [running body]
   temp35_0
   for-loop
   run-module-instance!
   perform-require!

real    20m18,706s
user    61m38,546s
sys     0m22,719s
$


I'll re-run the 4-thread test overnight.

What would be the best approach to debugging this issue? I assume I'll
load the racket binary in gdb and see the stack traces at the moment of
the crash, but that won't reveal the source of the problem (judging
based on my previous experience of debugging heavily multi-threaded
applications). Also I probably need a build with debugging symbols,
which is my plan for this afternoon.

I am running this on:

model name      : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

HT is enabled.

Although this is just a side project, my work (that is the paid-for
work) relies heavily on futures and GUI, so I would really like to nail
down and fix this problem.

Any suggestions are welcome.


Cheers,
Dominik

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/9e49fa26-5234-17eb-7dad-09df8a84b147%40trustica.cz.

Reply via email to