On 12/31 11:44, Jeremie Courreges-Anglas wrote:
> On Sat, Dec 28 2019, Jeremie Courreges-Anglas <[email protected]> wrote:
> > On Sat, Dec 28 2019, Jeremie Courreges-Anglas <[email protected]> wrote:
> >
> > [...]
> >
> >> I removed test_fiber.rb and am now running the tests again, trying to
> >> sort out why I'm hitting resource exhaustion...
> >
> > Build restarted with DEBUG=-g.
> 
> Back to this test_fiber.rb problem.  Looks like sparc register windows
> handling has been fragile for some time in ruby land:
> 
>   https://redmine.ruby-lang.org/issues/5244
> 
> The source code mentions a discussion with David Miller:
> 
>   https://marc.info/?l=linux-sparc&m=131914569320660&w=2
> 
> I have tried to use "ta 0x03" instead of "flushw [....%o7]" in
> coroutine_flush_register_windows(), it didn't help.
> 
> I couldn't find a core dump for test_fiber.rb, pointers for proper
> debugging welcome.  Instead, here's a simple reproducer.

After over 9 months of slacking and hoping someone else would fix it,
I finally took the time to look into this.  Both jca@ and kn@ offered
sparc64 access, and that was instrumental in debugging this issue.
Thanks kn@ for setting up my account and going out of his way so I can
get access from my IPv4-only network.

Between Ruby 2.6 and 2.7, the default coroutine implementation for Ruby
changed. Before Ruby 2.7, the coroutine implementation was pretty much
the same since Ruby 1.9.  In Ruby 2.7, the design is similar and equally
horrific. Fibers still switch by copying the entire stack to the heap and
calling setjmp, and then overwriting the entire stack with the stack of
the fiber you are switching to (which was previously copied to the heap),
then calling longjmp after overwriting the stack.  However, there were
some implementation differences between the implementations causing the
new implementation to not work.

I was able to modify the new implementation enough so that it does work
correctly on sparc64, passing test_fiber.rb, as long as the specific
files related to fibers are compiled without optimization.  With that,
ruby was able to pass the bootstrap tests, but it quickly failed when
getting to test-all with a SIGABRT.  From the core dump, I was able to
tell that was an issue I already fixed upstream earlier this year. With
that patch also backported, now test-all runs for much longer on sparc64
before also dying with a SIGABRT.  Unfortunately, no core dump for that
one, so I'm not sure what might be causing it.  Maybe it was over the
core dump limit and I need some ulimit -c or login.conf coredumpsize
magic?

Anyway, I think this is probably enough of an improvement to make Ruby
2.7 usable on sparc64.  However, since I'm way outside my area of
expertise, I'd like to get an OK on this from someone who uses sparc64
and ideally other non-amd64 non-i386 arches.

Here's a link to the fiber (copy coroutine) fix:
https://github.com/jeremyevans/ruby/commit/7008ca3838d33bd2d594251fc8c7f4ce52dde72d
 

Briefly, the changes made:

1) Don't flush register windows when saving the stack until right before
copying the stack to the heap. This fixes local variables not being
correct after the stack is restored.

2) Add a char[128] buffer to a couple functions. This is necessary (Ruby
crashes without it), though I'm not sure why.

3) Make sure we copy the entire current stack frame before copying the
stack to the heap by calling a function and returning the address of the
local variable inside of it (this causes a compiler warning). This is
necessary to ensure we don't have a partial copy of the current stack
frame.

The Ruby fiber maintainer has briefly reviewed these changes and is OK
with them, it is likely they will be upstreamed if they don't cause
problems on other arches.

It would also be great to get some feedback on this on arches besides
amd64, i386, and sparc64, if only to make sure that it doesn't make
things worse.  I would like to switch the default Ruby version in ports
from 2.6 to 2.7 before the end of the year, but I want to be sure that
won't break any of the arches we currently build packages for.

Thanks,
Jeremy

Index: Makefile
===================================================================
RCS file: /home/cvs/ports/lang/ruby/2.7/Makefile,v
retrieving revision 1.5
diff -u -p -u -p -r1.5 Makefile
--- Makefile    5 Oct 2020 17:13:13 -0000       1.5
+++ Makefile    21 Oct 2020 17:41:15 -0000
@@ -6,6 +6,13 @@ SHARED_LIBS =          ruby27  0.0
 NEXTVER =              2.8
 PKGSPEC-main ?=         ruby->=2.7.0,<${NEXTVER}
 
+REVISION-main =                0
+
+MASTER_SITES0 =                https://github.com/jeremyevans/ruby/commit/
+PATCHFILES =           
ruby-2.7-copy-coro-fix{7008ca3838d33bd2d594251fc8c7f4ce52dde72d}.diff:0 \
+                       
ruby-2.7-ffi-closure-fix{040cfc89b9a110cd0fb2abdcd35e8215b4a71f60}.diff:0
+PATCH_DIST_STRIP =     -p1
+
 PSEUDO_FLAVORS=                no_ri_docs bootstrap
 # Do not build the RI docs on slow arches
 .if ${MACHINE_ARCH:Malpha} || ${MACHINE_ARCH:Marm} || ${MACHINE_ARCH:Mhppa}
Index: distinfo
===================================================================
RCS file: /home/cvs/ports/lang/ruby/2.7/distinfo,v
retrieving revision 1.3
diff -u -p -u -p -r1.3 distinfo
--- distinfo    5 Oct 2020 17:13:13 -0000       1.3
+++ distinfo    21 Oct 2020 17:41:47 -0000
@@ -1,2 +1,6 @@
+SHA256 (ruby-2.7-copy-coro-fix.diff) = 
aZxV06Be48hCTvOY/vqWQJlPRepB/l+2K0GZIfMS9Sg=
+SHA256 (ruby-2.7-ffi-closure-fix.diff) = 
sFczOlxQZTGSDkXpvB0chFg81V4gWJYQavTgt3tCq5s=
 SHA256 (ruby-2.7.2.tar.gz) = blcG0NTuTh4viD2512hYa00GVn3r6jU8eW7EXoMhw9Q=
+SIZE (ruby-2.7-copy-coro-fix.diff) = 2498
+SIZE (ruby-2.7-ffi-closure-fix.diff) = 410
 SIZE (ruby-2.7.2.tar.gz) = 16836767
Index: patches/patch-common_mk
===================================================================
RCS file: /home/cvs/ports/lang/ruby/2.7/patches/patch-common_mk,v
retrieving revision 1.1.1.1
diff -u -p -u -p -r1.1.1.1 patch-common_mk
--- patches/patch-common_mk     2 Jan 2020 21:19:57 -0000       1.1.1.1
+++ patches/patch-common_mk     21 Oct 2020 17:02:22 -0000
@@ -4,6 +4,9 @@ Enable verbose mode when building.
 
 Don't regenerate rdoc documentation during install.
 
+Force building coroutine code without optimization on sparc64
+to avoid crashes.
+
 Index: common.mk
 --- common.mk.orig
 +++ common.mk
@@ -34,3 +37,20 @@ Index: common.mk
  pre-install-doc:: install-prereq
  do-install-doc: $(PROGRAM) pre-install-doc
        $(INSTRUBY) --make="$(MAKE)" $(INSTRUBY_ARGS) --install=rdoc 
$(INSTALL_DOC_OPTS)
+@@ -1522,6 +1522,16 @@ update-man-date: PHONY
+       -e 'BEGIN{@vcs=VCS.detect(ARGV.shift)}' \
+       -e '$$_.sub!(/^(\.Dd ).*/){[email protected](ARGF.path).strftime("%B 
%d, %Y")}' \
+       "$(srcdir)" "$(srcdir)"/man/*.1
++
++.if "sparc64" == ${MACHINE}
++cont.o: cont.c
++      @$(ECHO) compiling $<
++      $(Q) $(CC) $(CFLAGS) $(XCFLAGS) $(CPPFLAGS) -O0 $(COUTFLAG)$@ -c $<
++
++coroutine/copy/Context.o: coroutine/copy/Context.c
++      @$(ECHO) compiling $<
++      $(Q) $(CC) $(CFLAGS) $(XCFLAGS) $(CPPFLAGS) -O0 $(COUTFLAG)$@ -c $<
++.endif
+ 
+ HELP_EXTRA_TASKS = ""
+ 

Reply via email to