Re: Next steps after 2.2.1 release

Peter Firmstone Thu, 11 Apr 2013 04:15:17 -0700

Gregg,

Thanks again for your support.


I refactored LookupDiscovery and tidied up LookupLocatorDiscovery.

If you get some time, I could use a hand with other classes you'vealready fixed.

I'm working on MailboxImpl presently, there's some very dubious code,Threads being started from inside their constructors, called from staticinit methods from within MailboxImpl's constructor.

I know some would prefer me to prove something is broken before fixingit, providing tests that prove the failure, but this isn't an enterpriseproject and I lack the resources for such things, there's always theoption of running a 2.2 maintenance branch for those who'd like to waitlonger before upgrading.


Regards,

Peter.

On 11/04/2013 2:00 AM, Gregg Wonderly wrote:

I just want to extend this conversation a bit by saying that nearly everything about River is
"concurrently accessed". There are, of course several places, where work is done by one
thread, at a time, but new threads are created to do that work, and that means that
"visibility" has to be considered.

I won't say that every single field in every class in River needs to be final or volatile, but that should not be considered an extreme.
Specifically, you might see code execute just fine without appropriate concurrency design, and then it will suddenly break when a new optimization
appears on the scene, reordering something under the covers and creating an intangible behavior. Some "visibility bugs" might not ever
manifest because of other "happens before" and "cache line sync" activities that happen implicitly based on the "current
design" or "thread model". We can "be happy" with "it ain't broke, so don't fix it", but I don't think that's
very productive.

I personally, have been beating on various parts of Jini in my "fork" because of
completely unpredictable results in discovery and discovery management. I've written, rewritten,
debugged and stared at that code till I was blue in the face, because my ServiceUI desktop
application just doesn't behave like it should. Some of it is missing lifecycle management that
was not in the original services, because System.registerShutdownHook() hasn't been used. But
other parts are these race conditions and thread scheduling overlaps (or underlaps) which keep
discovery and notification from happening reliably. There are lots of different reasons why
people might not be "complaining" about this stuff, but I would contend that the fact
that there are many examples of people forking and extending Jini, which to me, reflects the fact
that there are things that aren't correct, or functional in the wild, and this causes them to jump
over the cliff and never look back.

We are at that point today, and Peter's continued slogging through the motions
to track down and discover where the issues actually are, is an astronomical
effort! I have been very involved in several different, new work opportunities
that have kept me from jumping in to participate in dealing with all of these
issues, as I have really wanted to.

Gregg Wonderly

On Apr 8, 2013, at 3:19 PM, Peter<[email protected]> wrote:

Thanks Gregg,

You've hit the nail on the head, this is exactly the issue I'm having.

So I've been fixing safe publication in constructors by making fields final or volatile 
and ensuring "this" doesn't escape, fixing synchronisation on collections etc 
during method calls.

To fix deadlock, I investigate immutable non blocking data structures with 
volatile publication, if future state doesn't depend on previous state, if it 
does a CAS atomic reference can be used instead of volatile.

Often i find synchronization is quite acceptable if it is limited in scope, if 
synchronized or holding a lock while a thread is executing outside your objects 
scope of control, that's when deadlock is more likely to occur.

The polciy providers were deadlock prone, which is why they're mostly immutable 
non blocking now, any synchronization or locking is limited.

I basically follow Doug Lea's concurrency in practise guidelines.

For debugging I follow Cliff Click's reccommendations.

Unfortunately fixing concurrency bugs means finding a trace of execution, 
identifying all classes and inspecting the code visually.  Findbugs identifies 
cases of inadequate sychronization using static analysis.

Regards,

Peter.

----- Original message -----

On 4/7/2013 7:03 PM, Greg Trasuk wrote:

I'm honestly and truly not passing judgement on the quality of the code. I
honestly don't know if it's good or bad. I have to confess that, given that
Jini was written as a top-level project at Sun, sponsored by Bill Joy, when
Sun was at the top of its game, and the Jini project team was a "who's-who" of
distributed computing pioneers, the idea that it's riddled with concurrency
bugs surprises me. But mainly, I'm still trying to answer that question - "How
do I know if it's good?" Here's what I'm doing: - I'm attempting to run the
tests from "tags/2.2.0" against the "2.2" branch. When I have confidence in
the "2.2" branch, I'll publish the results, ask anyone else who's interested
to test it, and then call for a release on "2.2.1" - After that, the
developers need to reach consensus about how to move forward. Cheers, Greg.

This is an important issue to address.  I know a lot of people here probably
don't participate on the Concurrency-interest mailing list that has a wide range
of discussion about the JLS vs the JMM and what the JIT compilers actually do to
code these days.

The number one issue that you need to understand, is that the optimizer is
working against you more and more these days if you don't have JMM details
exactly write.  Statements are being reordered more and more, including actual
"assignments" which can expose uninitialized data items in "racy" concurrent
code.  The latest example is the  Thread.setName()/Thread.getName() pair.  They
are most likely always to be accessed by "other threads", yet there is no
synchronization on them, including no "visibility" control with volatile even.
What this means, is that if setName() and getName() are being called in a racy
environment, the setName, will assign the array that is created to copy the
characters into, before the arraycopy of the data occurs, potentially exposing
an uninitialized name to getName().

There are literally hundreds of places in the JDK that still have these kinds of
races going on, and no one at Oracle, based on how people are acting, appears to
be responsible for dealing with it. The Jini code, has many many of the same
issues that just randomly appear in stress cases on "slower" or "faster"
hardware, depending on the issue.

When you haven't got sharing and visibility covered correctly, the JIT code
rewrites can make execution order play a big part in conflating what you "see"
happening verses what the "code" says, to you, should happen.

There are some very simple things to get the JIT out of the picture.  One of
these, is to actually open the source up in an IDE and declare every field
final.  If that doesn't work due to 'mutation' of values, change those fields to
'volatile' so that it will compile again.    Then run your tests and you will 
now
greatly diminish reordering and visibility issues so that you can just get to
the simple "was it set correctly, before it was read" and "did we provide the
correct atomicity for that update" kinds of questions that will help you
understand things better when code is misbehaving.

This is the kind of thing that Peter has been working through because the usage
of the code in real life has not continued in the same way that it did when the
code was written, and the JMM in JDK5 has literally broken so much software, all
over the planet, that used to work quite well, because there wasn't a formal
definition of "happens before".    Now that there is, the compiler optimizations
are against you if you don't get it right.  The behaviors you will experience,
because of reorderings that are targeted at all out performance (minimize
traffic in and out of the CPU through memory subsystems), can create completely
unexpected results.  Intra-thread semantics are kept correct, but inter-thread
execution will just seem intangible because stuff will not be happening in the
order the "code" says it should.

Gregg Wonderly

Re: Next steps after 2.2.1 release

Reply via email to