Lucifers,
We have discussed Clownfish interfaces already, but here's a quick recap. In
order to support callbacks from Clownfish into the host language, we currently
rely on class-based inheritance. In the host language, a Clownfish class is
subclassed, adding method implementations written in the host language. The
Clownfish compiler creates wrappers that call into the host language for each
such method. This works for Perl, and should work for other dynamic languages
like Python or Ruby as well.
This approach obviously doesn't work for host languages that don't support
class-based inheritance, like Go or Rust. One possible solution is to add
support for OOP interfaces which are the typically used by these languages to
provide dynamic dispatch. Interfaces are also useful on the Clownfish side alone.
I researched how several languages implement interfaces to come up with a
performant solution suitable for the Clownfish object system. Basic operations
on interfaces include:
- Static casts from objects to interface types. (The cast is known
to succeed at compile time.)
- Static casts from interface objects to the root object type.
- Dynamic casts from objects to interface types. (The object may
not implement the target interface and the cast may fail.)
- Invocation of interface methods.
For some of my earlier thoughts, see CLOWNFISH-12:
https://issues.apache.org/jira/browse/CLOWNFISH-12
Embedding multiple vtables (C++)
--------------------------------
C++ doesn't know interfaces as a distinct language concept, but they can be
easily emulated with multiple inheritance. An interface is simply a class
without member variables. How multiple inheritance is implemented in C depends
on the compiler, but the typical approach is described in Stroustrup's paper
"Multiple Inheritance for C++" (1999):
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.4735
The basic idea is to embed multiple vtables into each object and to implement
up- and downcasting by adjusting the pointer to an object by fixed offsets.
(The complications resulting from the diamond problem don't apply to
interfaces since they don't have member variables.)
We could use the same idea for Clownfish. Static casts only require to apply
an offset to the object pointer. Interface methods are invoked just like
normal methods. Dynamic casts are more costly. They're rather slow in some C++
implementations where the whole inheritance tree is searched. Another problem
is how to initialize the extra vtable pointers. Currently, we have a single
method Class_Make_Obj to initialize objects. With multiple vtable pointers, we
would need separate code paths for object initialization, or initialize
interface vtables lazily.
Fat pointers (Go, Rust)
-----------------------
The approach chosen by Go and Rust is to implement interface objects as "fat
pointers", that is a pair consisting of the pointer to the actual object and a
pointer to the interface vtable (itable) for dynamic dispatch. When casting an
object to interface type dynamically, Go requires a somewhat expensive hash
table lookup to find the itable. Rust doesn't support dynamic casts at all.
Static casts and method invocations are simple and fast.
The major downside is that fat pointers require double the storage space. This
is especially inconvenient for containers like arrays of interface objects.
Direct lookup (Java)
--------------------
Like with C++, the implementation of interfaces is up to the JVM, but I found
the paper "Efficient Implementation of Java Interfaces: Invokeinterface
Considered Harmless" (2001) to be enlightening:
http://dl.acm.org/citation.cfm?id=504291
http://www.research.ibm.com/people/d/dgrove/papers/oopsla01.pdf
Under this approach, there's no special representation for interface objects.
They're simply object pointers and casts are a no-op. When an interface method
is invoked, the method is looked up dynamically by name in a hash table. The
key to make this operation fast is to employ a hash table of function
pointers, pointing to small generated code stubs that invoke the actual
method, or, in the case of hash collisions, execute a short if-else-sequence.
I have no idea whether this is still the approach of choice in modern JVMs.
I'd be curious if anyone has additional pointers.
My proposal
-----------
I really like the simplicity of the direct lookup approach from the IBM paper
that makes no distinction between normal objects and interface objects and
avoids the overhead of fat pointers. But stubs that directly invoke interface
methods are impossible to implement in C without relying on
implementation-defined behavior or even assembler. It's also impossible to add
a default method to an interface, and invoke it on an object of a class from
an existing binary without recompilation. One of the important reasons for
default methods is to allow this kind of interface evolution.
But hash tables of function pointers with hard-coded collison resolution can
still be used to lookup itables efficiently. Here's how it works:
- In every Class struct there's a small hash table with maybe 4-8
slots. When invoking an interface method, an index into this table
is computed from the full name of the interface like
"Lucy::Analysis::Analyzer" and the table size. The important
observation is that this index is known at compile time. This means
that the hash function and table size must always be the same for a
certain Clownfish version, of course.
- The function pointer in the hash table is called. It will point to
generated code that looks like
lucy_Analyzer_ITABLE*
MYPARCEL_MyAnalyzer_ITABLE_GETTER_3(cfish_Obj *obj,
cfish_Interface *iface) {
if (iface == LUCY_ANALYZER) {
return MYPARCEL_MyAnalyzer_Analyzer_ITABLE;
}
else if (...) {
// Possibly other interfaces implemented by MyAnalyzer
// that happen to hash to the same slot. But typically,
// there will be only a single interface.
}
else {
// Other ways to handle errors are possible.
CFISH_THROW(CFISH_ERR,
"Class %o doesn't implement interface %o",
CFISH_Obj_Get_Class_Name(obj);
CFISH_Interface_Get_Name(iface));
}
}
- MYPARCEL_MyAnalyzer_Analyzer_ITABLE is populated dynamically during
Clownfish bootstrap, handling default methods and allowing
interface evolution.
- The interface method is looked up in the returned itable, using
a global offset variable, similar to normal Clownfish method
dispatch.
Invoking interface methods this way should result in about twice the overhead
of normal method invocation. It's also possible to combine this with the fat
pointer approach, avoiding the itable lookup for subsequent interface method
calls.
Host language callbacks
-----------------------
I can see two approaches. The first one autogenerates a "host" class for every
interface. These classes contain a handle pointing to the host language object
(SV* for Perl, registry index for Go) and have autogenerated method
implementations that always call into the host language. Converting a host
language object to Clownfish requires an initial interface to be specified.
It's not possible to dynamically cast the resulting Clownfish object to other
interface types the host language object may implement.
In the second approach, there's only a single Clownfish class for host
objects. A host object can also be directly converted to Obj without
specifying an interface type. When casting the host object to an interface
type dynamically, the itable lookup is modified to check whether the host
object actually implements the interface, then returns the appropriate itable.
The problem with this approach is that it doesn't map to languages like Rust
(or C++ without RTTI) that don't support dynamic casts.
In both approaches, converting a host object to Clownfish always creates a new
Clownfish object. This is more expensive than the current Perl implementation
which caches the Clownfish object, but allows standard Perl objects containing
a blessed hash. Converting back to the host language simply extracts the handle.
To make default methods work, there's generated code on the host language side
of an interface that calls the Clownfish implementation directly. This means
that calling a default method from Clownfish results in a
Clownfish-Host-Clownfish roundtrip but this should be acceptable. This
overhead could be avoided for host languages that allow introspection by
creating a custom itable for every host-class/interface combination, similar
to the way we currently register Perl classes in the Clownfish registry. But I
don't think it's worth the effort.
Concluding remarks
------------------
If we decide to switch to interfaces for host language callbacks, I intend to
completely remove the ability to subclass Clownfish classes from languages
like Perl. This means to remove some really well-thought-out code, but if we
want to expand our scope to languages without class-based inheritance, we
shouldn't support features that only work for some host languages. It also
means that the Perl callback mechanism will become slower, though it probably
won't be noticable given how slow Perl method calls are. The fact that users
can write normal Perl classes without the need for inside-out member variables
weighs up for that.
I think I can implement the approach described above in a few months for a 0.7
release, including Perl and Go support. Unless there are major objections to
my plan, I'll just start on a separate branch. So if there are unforeseen
obstacles, we don't have to revert commits on the master branch.
We will have to redesign all Lucy classes that are meant to be overridable
from the host language. For that, I'll need feedback from the rest of the
community, especially Marvin.
I don't plan to support subinterfaces (interfaces implementing other
interface) in the first iteration, but this should be straightforward to add.
I also may omit features from the initial implementation that aren't necessary
for Lucy callbacks.
Nick