On Fri, Aug 9, 2019 at 5:23 PM Waldek Kozaczuk <jwkozac...@gmail.com> wrote:

> Mystery solved!!! One character makes a difference (it is is the 3rd bug I
> have found in last 2 weeks where the fix is a single line of code ;-).
>
> It turns one we have code in OSv that uses sem_post() and sem_wait() (or
> semaphore.cc methods directly) but there only single-use of sem_trywait()
> in ACPI (
> https://github.com/cloudius-systems/osv/blob/master/drivers/acpi.cc#L135)
> which I am sure is even actively used. There are no direct unit tests
> around semaphore.cc or sem_* methods. And Java has just started using
> semaphores (sem_* functions; learned from an email on Hotspot OpenJDK
> mailing list).
>
> So the is this bug in trywait():
>
> diff --git a/core/semaphore.cc b/core/semaphore.cc
>
> index 31fda35e..e2ccc801 100644
>
> --- a/core/semaphore.cc
>
> +++ b/core/semaphore.cc
>
> @@ -62,7 +62,7 @@ bool semaphore::trywait(unsigned units)
>
>  {
>
>      bool ok = false;
>
>      WITH_LOCK(_mtx) {
>
> -        if (_val > units) {
>
> +        if (_val >= units) {
>
             _val -= units;
>
>              ok = true;
>
>          }
>
> With this simple fix Java 12 starts properly.
>

Fantastic. I just committed your patch with this fix. Thanks.


> On Monday, August 5, 2019 at 11:10:32 PM UTC-4, Waldek Kozaczuk wrote:
>>
>> I have not gotten to the bottom of this issue but I think I have narrowed
>> it down and found a workaround.
>>
>> First of all, I discovered a JVM option that limits the usage of signals:
>> `-Xrs' (please see for details this article
>> https://docs.oracle.com/en/java/javase/12/troubleshoot/handle-signals-and-exceptions.html#GUID-57C048F6-0D4B-43BD-B27C-06A613435360
>> ).
>>
>> So after I added -Xrs to the command line I would still see the almost
>> exact same type of 'deadlock-like' threads state except the 'Signal
>> Dispatch' is no longer present. Here is a part of the thread dump:
>> thread 0xffff8000014ae040 /usr/lib/jvm/jd
>>
>> #0  sched::thread::switch_to (this=0xffffa00003b3b6f0, 
>> this@entry=0xffff800001fdd040)
>> at arch/x64/arch-switch.hh:108
>> #1  0x00000000403ff794 in sched::cpu::reschedule_from_interrupt
>> (this=0xffff800000013040, called_from_yield=called_from_yield@entry
>> =false,
>> preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
>> #2  0x00000000403ffc8c in sched::cpu::schedule () at
>> include/osv/sched.hh:1310
>> #3  0x0000000040400692 in sched::thread::wait (this=0xffff8000014ae040)
>> at core/sched.cc:1214
>> #4  sched::thread::do_wait_until<sched::noninterruptible,
>> sched::thread::dummy_lock, sched::thread::join()::<lambda()> >
>> (mtx=<synthetic po
>> inter>..., pred=...) at include/osv/sched.hh:1066
>> #5  sched::thread::wait_until<sched::thread::join()::<lambda()> >
>> (pred=...) at include/osv/sched.hh:1077
>> #6  sched::thread::join (this=<optimized out>) at core/sched.cc:1311
>> #7  0x000000004045f5b6 in pthread_private::pthread::join
>> (retval=0x2000001fbbc8, this=0xffffa00000982d70) at
>> /usr/include/c++/8/bits/unique
>> _ptr.h:342
>> #8  pthread_join (thread=18446638520603258224, retval=0x2000001fbbc8) at
>> libc/pthread.cc:245
>> #9  0x0000100000609efd in ContinueInNewThread0 ()
>> #10 0x0000100000606552 in ContinueInNewThread ()
>> #11 0x0000100000607b68 in JLI_Launch ()
>> #12 0x00000000004009f3 in main ()
>> #13 0x00000000404308fd in osv::application::run_main
>> (this=0xffffa0000115e210) at /usr/include/c++/8/bits/stl_vector.h:805
>> #14 0x0000000040430a4f in __libc_start_main (main=0x4008d0 <main>) at
>> core/app.cc:37
>> #15 0x0000000000400a91 in _start ()
>> #16 0x00002000001fff48 in ?? ()
>> #17 0xfffffffffffff998 in ?? ()
>> #18 0xffffa000012ae4c0 in ?? ()
>> #19 0x000000000000003c in ?? ()
>> #20 0x000000000000003c in ?? ()
>> #21 0x0000000000000040 in ?? ()
>> #22 0x00002000001fffa0 in ?? ()
>> #23 0xffffa0000129d900 in ?? ()
>> #24 0xffffa00000982d50 in ?? ()
>> #25 0xffff800000013040 in ?? ()
>> #26 0x00002000001fffa0 in ?? ()
>> #27 0x0000000040430aa9 in osv::application::<lambda(void*)>::operator()
>> (__closure=0x0, app=<optimized out>) at core/app.cc:233
>> #28 osv::application::<lambda(void*)>::_FUN(void *) () at core/app.cc:235
>> Backtrace stopped: Cannot access memory at address 0x8
>>
>> thread 0xffff800001fdd040 >/usr/lib/jvm/j
>>
>> #0  sched::thread::switch_to (this=0xffffa00003b3b6f0, 
>> this@entry=0xffff800003722040)
>> at arch/x64/arch-switch.hh:108
>> #1  0x00000000403ff794 in sched::cpu::reschedule_from_interrupt
>> (this=0xffff800000013040, called_from_yield=called_from_yield@entry=false,
>> preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
>> #2  0x00000000403ffc8c in sched::cpu::schedule () at
>> include/osv/sched.hh:1310
>> #3  0x0000000040400372 in sched::thread::wait 
>> (this=this@entry=0xffff800001fdd040)
>> at core/sched.cc:1214
>> #4  0x00000000403e38df in
>> sched::thread::do_wait_until<sched::noninterruptible,
>> sched::thread::dummy_lock, waiter::wait(sched::timer*)
>> const::{lambda()#1}>(sched::thread::dummy_lock&,
>> waiter::wait(sched::timer*--Type <RET> for more, q to quit, c to continue
>> without paging--
>> ) const::{lambda()#1}) (pred=..., mtx=<synthetic pointer>...) at
>> include/osv/sched.hh:939
>> #5  sched::thread::wait_until<waiter::wait(sched::timer*)
>> const::{lambda()#1}>(waiter::wait(sched::timer*) const::{lambda()#1})
>> (pred=...) at include/osv/sched.hh:1077
>> #6  waiter::wait (tmr=0x0, this=0x2000002fdb40) at
>> include/osv/wait_record.hh:46
>> #7  condvar::wait (this=0xffffa0000202bb00,
>> user_mutex=0xffffa0000382ca80, tmr=<optimized out>) at core/condvar.cc:43
>> #8  0x0000100001637213 in os::PlatformEvent::park() ()
>> #9  0x00001000015d48e8 in Monitor::IWait(Thread*, long) ()
>> #10 0x00001000015d5aa6 in Monitor::wait(bool, long, bool) ()
>> #11 0x0000100001990237 in VMThread::execute(VM_Operation*) ()
>> #12 0x0000100000fef3ca in CodeCache::flush_dependents_on(InstanceKlass*)
>> ()
>> #13 0x00001000018c8f30 in
>> SystemDictionary::define_instance_class(InstanceKlass*, Thread*) ()
>> #14 0x00001000018c92f2 in
>> SystemDictionary::find_or_define_instance_class(Symbol*, Handle,
>> InstanceKlass*, Thread*) ()
>> #15 0x00001000018cb2fc in SystemDictionary::load_instance_class(Symbol*,
>> Handle, Thread*) ()
>> #16 0x00001000018ca1c2 in
>> SystemDictionary::resolve_instance_class_or_null(Symbol*, Handle, Handle,
>> Thread*) ()
>> #17 0x00001000018cc9e4 in SystemDictionary::resolve_or_fail(Symbol*,
>> Handle, Handle, bool, Thread*) ()
>> #18 0x0000100001077a8d in ConstantPool::klass_at_impl(constantPoolHandle
>> const&, int, bool, Thread*) ()
>> #19 0x00001000012871cc in InterpreterRuntime::_new(JavaThread*,
>> ConstantPool*, int) ()
>> #20 0x0000200001c201c5 in ?? ()
>> #21 0x0000200001c20170 in ?? ()
>> #22 0x00002000002fe408 in ?? ()
>> #23 0x0000200010fec980 in ?? ()
>> #24 0x00002000002fe460 in ?? ()
>> #25 0x0000200010fed4c8 in ?? ()
>> #26 0x0000000000000000 in ?? ()
>>
>> thread 0xffff80000313a040 VM Thread
>>
>> #0  0x00000000403e236c in lockfree::mutex::unlock 
>> (this=this@entry=0xffffa00003553e90)
>> at /usr/include/c++/8/bits/atomic_base.h:512
>> #1  0x00000000403e36bc in std::lock_guard<lockfree::mutex>::~lock_guard
>> (this=<synthetic pointer>, __in_chrg=<optimized out>) at
>> /usr/include/c++/8/bits/std_mutex.h:167
>> #2  lock_guard_for_with_lock<lockfree::mutex>::~lock_guard_for_with_lock
>> (this=<synthetic pointer>, __in_chrg=<optimized out>) at
>> include/osv/mutex.h:88
>> #3  semaphore::trywait (this=0xffffa00003553e80, units=units@entry=1) at
>> core/semaphore.cc:64
>> *#4  0x00000000404656c1 in sem_trywait (s=<optimized out>) at
>> /usr/include/c++/8/bits/unique_ptr.h:342*
>> #5  0x0000100001702bf2 in PosixSemaphore::trywait() ()
>> #6  0x000010000121aab4 in VM_HandshakeAllThreads::doit() ()
>> #7  0x000010000198a8d7 in VM_Operation::evaluate() ()
>> #8  0x0000100001990706 in VMThread::evaluate_operation(VM_Operation*)
>> [clone .constprop.64] ()
>> #9  0x0000100001990e7a in VMThread::loop() ()
>> #10 0x0000100001991073 in VMThread::run() ()
>> #11 0x00001000019186ff in Thread::call_run() ()
>> #12 0x000010000162e1be in thread_native_entry(Thread*) ()
>> #13 0x000000004045edb6 in
>> pthread_private::pthread::<lambda()>::operator()
>> (__closure=0xffffa00002c3df00) at libc/pthread.cc:114
>> #14 std::_Function_handler<void(),
>> pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const
>> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const
>> std::_Any_data &) (__functor=...)
>>     at /usr/include/c++/8/bits/std_function.h:297
>> #15 0x0000000040401117 in sched::thread_main_c (t=0xffff80000313a040) at
>> arch/x64/arch-switch.hh:271
>> #16 0x00000000403a71d3 in thread_main () at arch/x64/entry.S:113
>>
>> thread 0xffff800003462040 Reference Handl
>>
>> #0  sched::thread::switch_to (this=0xffffa00003b3b6f0, 
>> this@entry=0xffff800001fdd040)
>> at arch/x64/arch-switch.hh:108
>> #1  0x00000000403ff794 in sched::cpu::reschedule_from_interrupt
>> (this=0xffff800000013040, called_from_yield=called_from_yield@entry=false,
>> preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
>> #2  0x00000000403ffc8c in sched::cpu::schedule () at
>> include/osv/sched.hh:1310
>> #3  0x0000000040400372 in sched::thread::wait 
>> (this=this@entry=0xffff800003462040)
>> at core/sched.cc:1214
>> #4  0x00000000403e38df in
>> sched::thread::do_wait_until<sched::noninterruptible,
>> sched::thread::dummy_lock, waiter::wait(sched::timer*)
>> const::{lambda()#1}>(sched::thread::dummy_lock&,
>> waiter::wait(sched::timer*) const::{lambda()#1}) (pred=..., mtx=<synthetic
>> pointer>...) at include/osv/sched.hh:939
>> #5  sched::thread::wait_until<waiter::wait(sched::timer*)
>> const::{lambda()#1}>(waiter::wait(sched::timer*) const::{lambda()#1})
>> (pred=...) at include/osv/sched.hh:1077
>> #6  waiter::wait (tmr=0x0, this=0x200011800870) at
>> include/osv/wait_record.hh:46
>> --Type <RET> for more, q to quit, c to continue without paging--
>> #7  condvar::wait (this=0xffffa00003046680,
>> user_mutex=0xffffa000030f9a40, tmr=<optimized out>) at core/condvar.cc:43
>> #8  0x0000100001637213 in os::PlatformEvent::park() ()
>> #9  0x00001000015d48e8 in Monitor::IWait(Thread*, long) ()
>> #10 0x00001000015d5aa6 in Monitor::wait(bool, long, bool) ()
>> #11 0x000010000134cd7e in JVM_WaitForReferencePendingList ()
>> #12 0x0000200001c0ffb7 in ?? ()
>> #13 0x00002000118009f8 in ?? ()
>> #14 0x00002000118009a8 in ?? ()
>> #15 0x0000000000000000 in ?? ()
>>
>> thread 0xffff800003583040 Finalizer
>>
>> #0  sched::thread::switch_to (this=0xffffa00003b3b6f0, 
>> this@entry=0xffff80000442d040)
>> at arch/x64/arch-switch.hh:108
>> #1  0x00000000403ff794 in sched::cpu::reschedule_from_interrupt
>> (this=0xffff800000013040, called_from_yield=called_from_yield@entry=false,
>> preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
>> #2  0x00000000403ffc8c in sched::cpu::schedule () at
>> include/osv/sched.hh:1310
>> #3  0x0000000040400372 in sched::thread::wait 
>> (this=this@entry=0xffff800003583040)
>> at core/sched.cc:1214
>> #4  0x00000000403e38df in
>> sched::thread::do_wait_until<sched::noninterruptible,
>> sched::thread::dummy_lock, waiter::wait(sched::timer*)
>> const::{lambda()#1}>(sched::thread::dummy_lock&,
>> waiter::wait(sched::timer*) const::{lambda()#1}) (pred=..., mtx=<synthetic
>> pointer>...) at include/osv/sched.hh:939
>> #5  sched::thread::wait_until<waiter::wait(sched::timer*)
>> const::{lambda()#1}>(waiter::wait(sched::timer*) const::{lambda()#1})
>> (pred=...) at include/osv/sched.hh:1077
>> #6  waiter::wait (tmr=0x0, this=0x2000119006c0) at
>> include/osv/wait_record.hh:46
>> #7  condvar::wait (this=0xffffa00003046b00,
>> user_mutex=0xffffa000030f9d00, tmr=<optimized out>) at core/condvar.cc:43
>> #8  0x0000100001637213 in os::PlatformEvent::park() ()
>> #9  0x00001000016002d5 in ObjectMonitor::wait(long, bool, Thread*) ()
>> #10 0x00001000018c1132 in ObjectSynchronizer::wait(Handle, long, Thread*)
>> ()
>> #11 0x0000100001345778 in JVM_MonitorWait ()
>> #12 0x0000200001c0ffb7 in ?? ()
>> #13 0x0000200011900968 in ?? ()
>> #14 0xffff900003240000 in ?? ()
>> #15 0x0000200011900910 in ?? ()
>> #16 0x00002000119008e8 in ?? ()
>> #17 0x0000000000000000 in ?? ()
>>
>> There is still running thread VM Thread calling sem_trywait.
>>
>> After more research, I have found about this JEP -
>> http://openjdk.java.net/jeps/312 - which was gradually added to Java 10
>> and probably enhanced in next versions. As I understand in essence it
>> replaces so called 'safepoints (global lock for "stop the world events")
>> with a more granular mechanism - "thread local handshakes". Here is the
>> fragment from the doc:
>>
>> "A handshake operation is a callback that is executed for each
>> JavaThread while that thread is in a safepoint safe state. The callback is
>> executed either by the thread itself or by the VM thread while keeping the
>> thread in a blocked state. The big difference between safepointing and
>> handshaking is that the per thread operation will be performed on all
>> threads as soon as possible and they will continue to execute as soon as
>> it’s own operation is completed. If a JavaThread is known to be running,
>> then a handshake can be performed with that single JavaThread as well."
>>
>> I think the core of the logic is in
>> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/handshake.cpp
>>  and
>> one of the methods is in this thread stacktrace:
>> #0  0x00000000403e236c in lockfree::mutex::unlock 
>> (this=this@entry=0xffffa00003553e90)
>> at /usr/include/c++/8/bits/atomic_base.h:512
>> #1  0x00000000403e36bc in std::lock_guard<lockfree::mutex>::~lock_guard
>> (this=<synthetic pointer>, __in_chrg=<optimized out>) at
>> /usr/include/c++/8/bits/std_mutex.h:167
>> #2  lock_guard_for_with_lock<lockfree::mutex>::~lock_guard_for_with_lock
>> (this=<synthetic pointer>, __in_chrg=<optimized out>) at
>> include/osv/mutex.h:88
>> #3  semaphore::trywait (this=0xffffa00003553e80, units=units@entry=1) at
>> core/semaphore.cc:64
>> #4  0x00000000404656c1 in sem_trywait (s=<optimized out>) at
>> /usr/include/c++/8/bits/unique_ptr.h:342
>> #5  0x0000100001702bf2 in PosixSemaphore::trywait() ()
>> *#6  0x000010000121aab4 in VM_HandshakeAllThreads::doit() ()*
>> #7  0x000010000198a8d7 in VM_Operation::evaluate() ()
>> #8  0x0000100001990706 in VMThread::evaluate_operation(VM_Operation*)
>> [clone .constprop.64] ()
>> #9  0x0000100001990e7a in VMThread::loop() ()
>> #10 0x0000100001991073 in VMThread::run() ()
>> #11 0x00001000019186ff in Thread::call_run() ()
>> #12 0x000010000162e1be in thread_native_entry(Thread*) ()
>> #13 0x000000004045edb6 in
>> pthread_private::pthread::<lambda()>::operator()
>> (__closure=0xffffa00002c3df00) at libc/pthread.cc:114
>> #14 std::_Function_handler<void(),
>> pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const
>> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const
>> std::_Any_data &) (__functor=...)
>>     at /usr/include/c++/8/bits/std_function.h:297
>> #15 0x0000000040401117 in sched::thread_main_c (t=0xffff80000313a040) at
>> arch/x64/arch-switch.hh:271
>> #16 0x00000000403a71d3 in thread_main () at arch/x64/entry.S:113
>>
>> Now, there is a way to switch off this new mechanism (which clearly is a
>> default in Java 12) and get back to the "safepoints" one by adding this
>> option: -XX:-ThreadLocalHandshakes
>>
>> When I add this option the jetty app that used to hang, boots just fine.
>>
>> So there is either a bug on Java side (but why it works on Linux?) or
>> there is some bug on OSv side possibly due to some difference with Linux
>> how some of the related calls are implemented.
>>
>> Lastly, I have found this code in Java which seems, even more, to confirm
>> that signal-related difference in sem_trywait implementation does not play
>> a role here:
>>
>> bool PosixSemaphore::trywait() {
>>   int ret;
>>
>>   do {
>>     ret = sem_trywait(&_semaphore);
>>   } while (ret != 0 && errno == EINTR);
>>
>>   assert_with_errno(ret == 0 || errno == EAGAIN, "trywait failed");
>>
>>   return ret == 0;
>> }
>>
>> On Sunday, July 21, 2019 at 5:58:41 AM UTC-4, Nadav Har'El wrote:
>>>
>>>
>>> On Sat, Jul 20, 2019 at 8:41 AM Waldek Kozaczuk <jwkoz...@gmail.com>
>>> wrote:
>>>
>>>> Are we properly handling the EINTR scenario, where thread blocked on
>>>> sem_wait() or sem_timed_wait() should get interrupted if signal sent to it?
>>>>
>>>
>>> No, I believe we do not support this correctly. Please see my other mail
>>> on some pointers on what we need to change to properly support EINTR for
>>> sem_wait() and random signals (except SIGALRM).
>>> But please try to understand whether it is possible that this code
>>> changed in Java 12, or why Java hangs waiting for it. If not, then maybe
>>> it's a problem that always existed but never caused any problems?
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/d9098e04-ceca-4752-a476-6a31cdb0c5c8%40googlegroups.com
> <https://groups.google.com/d/msgid/osv-dev/d9098e04-ceca-4752-a476-6a31cdb0c5c8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CANEVyjvLZUk5OUwPGSxdbDVPz01s%3D0%3Db58ygQ5tqdPmTLh59uQ%40mail.gmail.com.

Reply via email to