thinking about proper implementation of tail calls

John Rose Thu, 30 Aug 2012 17:40:51 -0700

I haven't given up on "hard tail calls" aka "proper implementation of tail 
calls".  Other things are more pressing—as always, it seems.  Here's a quick 
update, so we can mull this over as we look for an opportune time to warm up 
the project.


Michael Haupt passed me this useful link to Guy Steele's 2011 blog entry about 
tail calls:
  http://www.eighty-twenty.org/index.cgi/tech/oo-tail-calls-20111001.html

The comments are very interesting, and contain other good links, including 
Matthias Felleisen's ECOOP 2004 keynote.

As I recall, Doug Lea noted at the 2010 JVM Language Summit that tail calls 
would seem to allow work-stealing algorithms to be implemented somewhat more 
cleanly or efficiently.  (How's that for tentative?)  A worker thread goes from 
task to task in a data-driven way, similar to the examples given (e.g., in 
Guy's article) of linked list traversal.  Currently you need an outer loop to 
"pump" the task sequence, and each task must return to the loop to allow the 
next task to run.  If you had tail calls, you would not need to force the task 
sequence into a form that the outer loop can understand, which would allow more 
flexibility in organizing the task sequence.  More freedom in shaping the code 
can translate into better-tuned algorithms.

However this effect may in practice be swamped by other performance issues, 
such as synchronization.  The main claim here is that there can come a point, 
when other overheads are minimized, that the lack of tail calls blocks the next 
optimization.

The point is analogous to the discussion about driver loops versus tail-calling 
iteration patterns.  The multiple threads add a dimension.  Tail calling gives 
a thread (more properly, a stack frame within a thread) a way to transition to 
the next task without passing through a trampoline (aka, a driver loop).  It 
also allows modest recursion (through a data structure of modest depth), 
without the overhead of creating multiple copies of the driver loop.  This 
latter point is probably the most interesting, since work-stealing task 
structure is likely to be "lumpy" (somewhat nested) and not always a clean 
linear list.

Here is a wiki page on tail calls that we collected at the 2010 Summit:
  http://wiki.jvmlangsummit.com/Why_Tailcalls

I have been looking for a "killer application" for tail calls on the JVM.  
Doug's example may turn out to one.  Here's another candidate:  Overcoming 
class file size limits.

Suppose you are compiling your favorite high-level language to the JVM, and you 
start running into the various size limits in class files.  (Usually they come 
from the 16-bit indexes in various places.)  What do you do, dear?  (H/T to 
Maurice Sendak, http://www.amazon.com/What-Do-You-Dear/dp/0064431134 .)

You could growl a curse and try to emit smaller methods, by changing earlier 
parts of your compilation stack.  Or you could bail out and use an AST 
interpreter.  In fact, these are your only options at present.

(To defend the Status Quo:  JVMs are tuned to work well on small methods, and 
so they have a synergistic bias toward the 16-bit limit.  Gargantuan methods 
make JITs unhappy, so maybe turning the problem back on language implementors 
is just a way of driving them away the edge of the NP-hard canyon.)

It would be great if there were an ASM module which could automatically render 
an over-sized "mega-method" into some JVM-compliant form.  What would this 
take?  A number of things:

1. The proposed mega-method would have to be broken into a control flow graph 
of basic blocks.

2. The basic blocks would have to be packed into one or more 
automatically-generated normal "mini-methods" (of JVM-compliant sizes).

3. Control transitions between basic blocks would have to be re-coded, if they 
cross from one mini-method to another.
3a. These transitions would not be allowed to blow the stack.
3b. The live data in local variables (as of the given transition) would have to 
be transmitted somehow.

4. Something clever would have to be done to make the runtime stack walker see 
the originally named mega-method.

Step 3b is non-trivial, and probably requires boxing on the heap, in general.  
(Remember that you can have up to 2^16-1 locals, but only 2^8-1 parameters.)  
In the case where there are only a few locals, they can just be passed in 
registers as parameters.

Step 3a would seem to require tail calls.  Specifically, the act of leaving one 
mini-method M1 to go to a block in another mini-method M2 requires a hard tail 
call from M1 to M2, passing any live data values that the M2 block requires.

The corresponding act of entering a given target block in M2 also requires some 
thought.  Unless you are lucky or clever, or unless you render one basic block 
to one mini-method, M2 will (logically) have multiple entry points, one for 
each basic block that is a successor of a basic block in some other mini-methd 
M1.  Representing such multiple entry points is tricky.  Here are three ways to 
do it:

A. Add an integer parameter (basic block selector index) to M2, and structure 
M2 as a Big Switch.  (This is a common technique; Rhino does this for example.) 
 Then M1 passes the selector as an extra parameter.

B. Add multiple entry points to the class file format.  In particular, allow M2 
to declare multiple entry points, perhaps with an entry point instruction which 
looks like a Big Switch.  Compile the rest of M2 as in A.  Link calls into M2 
using compound symbolic references such as "M2.42", where 42 is the switch 
selector.  (Note that dot '.' is illegal in method names.)

C. Structure M2 as in A, but run the call from M1 through invokedynamic, 
without the extra parameter.  When linking the call from M1, bind the extra 
parameter into the method handle for M2.  Ask your friendly neighborhood JVM 
implementor to optimize this method handle, at link time, into a form which is 
equivalent to the extra entry points in B.

BTW, an early proposals for JVM support for closures had a similar flavor:  We 
might have created an instruction for building a closure which would (a) 
mention a BCI in the same method that provides the closure body, and (b) 
specify which locals are to be bound into the closure.  The attractive thing 
about such a design is that it does not require the overhead of building a fake 
method body (really, a mini-method) for each lambda.  The unattractive thing is 
that it requires lots of other new structure in the JVM spec.  I like where we 
fell out with using invokedynamic to render closures.  It might be nice, 
though, to allow some kind of light-weight private/anonymous method as a 
shorthand in class files.  I don't know what that would look like.  Maybe it 
would be a CONSTANT_NameAndTypeAndBCI.

— John

P.S.  The distinction between "soft" and "hard" tail calls is described here:
  https://blogs.oracle.com/jrose/entry/tail_calls_in_the_vm

I don't care much about soft tail calls (as an elective optimization) as much 
as hard tail calls (proper implementation, which makes a contract not to blow 
the application stack under defined conditions).
_______________________________________________
mlvm-dev mailing list
mlvm-dev@openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

thinking about proper implementation of tail calls

Reply via email to