I don't know if this topic has probably been already beat to death, or
is otherwise not very interesting or relevant here, but alas...
it is a question though what is the "ideal" level of abstraction (and
"generality") in a VM.
for example, "LLVM" is fairly low level (using a statically-typed
SSA-form as an IR, and IIRC a partially-decomposed type-system).
the JVM is a little higher level, being a statically-typed stack machine
(using "primitive" types for stack elements and operations), with an
abstracted notion of in-memory class layout;
MSIL/CIL is a little higher still, abstracting the types out of the
stack elements (all operations work against inferred types, and unlike
the JVM there is no notion of "long and double take 2 stack slots", ...).
both the JVM and MSIL tend to declare types "from the POV of their point
of use", rather than from their point of declaration. hence, the "load"
or "call" operations directly reference a location giving the type of
the variable.
similarly, things like loads / stores / method-calls/dispatching / ...
are resolved prior to emitting the bytecode.
in my VMs, I have tended to leave the types "at the point of
declaration", hence all the general load/store/call operations "merely"
link to a symbolic-reference.
one of my attempts (this VM never got fully implemented) would have
attempted to pre-resolve all scoping (like in the JVM or .NET, but ran
into problems WRT a complex scoping model), but I have not generally
done this.
my current VM only does so for the lexical scope, which is treated
conceptually as a stack:
all variable declarations are "pushed" to the lexical environment, and
"popped" when a given frame exits;
technically, function arguments are pushed in left-to-right order,
meaning that (counter-intuitively) their index numbers are reverse of
their argument position;
unlike in JBC or MSIL, the index does not directly reference a declared
variables' declaration, merely its relative stack position, hence it is
also needed to "infer" the declaration;
note that it being (conceptually) a stack also does not imply it is
physically also represented as a stack.
hence, in the above case, the bytecode not too far removed from the
source code.
I guess one can argue, that as one moves up the abstraction layer, then
the amount of work needed in making the VM becomes larger (it deals with
far more semantics issues, and is arguably more specific to the
particular languages in use, ...).
I suspect it is much less clear cut than this though, for example,
targeting a dynamic-language (such as Scheme or JavaScript) to a VM such
as LLVM or JBC (pre JDK7) essentially requires implementing much of the
VM within the VM, and may ultimately reduce how effectively the VM can
optimize the code (rather than merely dealing with the construct, now
the first VM also has to deal with how the second VM's constructs were
implemented on top of the first VM).
a secondary issue is when the restrictions of such a VM (particularly
the JVM) impede what can be effectively expressed within the VM, running
counter to the notion that higher abstraction necessarily equates to
greater semantic restrictions.
the few cases where I can think of where the argument does make a
difference include:
the behavior of variable scoping (mostly moot for JVM, which pretty much
hard-codes this);
the effects of declaration modifiers (moot regarding JVM and .NET, which
manage modifiers internally).
the "shape" of the type-system and numeric tower (likewise as the above,
although neither "enforces" a particular type-system, neither gives much
room for it to be effectively done much differently, likewise in LLVM
and ASM one is confined to whatever is provided by the HW).
the behavior of specific operators as applied to specific types. this
may be a merit of the JVM and .NET arguably vs my own VMs, since both
VMs only perform operations directly against primitive types, the
behavior of mixed-type cases is de-facto left to the language and
compiler, this may be ultimately a moot point, as manual type-coercion
or scope-qualified operator overloading could achieve the same ends.
similarly, a high-level VM could also (simply) discard the notion of
built-in/hard-coded operator+type semantics, and instead expect the
compiled code to either overload operators or import a namespace
containing the desired semantics (say, built-in or library-supplied
overloaded operators). more-so, unlike the JVM and .NET strategies, this
does not mandate the need for static typing (prior to emitting bytecode)
in order to achieve language-specific type-semantics.
in the above case (operators being a result of an implicit import), if
Language-A disallows "string+int", Language-B interprets it as "append
the string(a) with int::toString(b)", and Language-C as "offset the
string by int chars", well then, the languages can each do so without
interfering with the others.
...
or, in effect, I am seeing fairly little compelling reason (apart from
simplicity of the VM runtime) for why a lower-level VM representation
would be necessarily preferable to a higher-level one.
one may almost as well just make a VM represent a distilled-down version
of C++ style semantics (probably with some extensions and omissions, and
represented as bytecode), and make the requirement for other languages
"figure out how to compile your code into working with C++ like
semantics...", this being in contrast to the alternative trend which is
to make VM's which look (more or less) like brain-damaged versions of
assembler.
so, probable features of such a VM architecture:
probably bytecode, and probably a stack machine (any "good" reason to do
otherwise?);
probably type-inferred values;
types are not declared at their point of use (they are declared at their
point of declaration);
avoidance of explicitly generated type coercions (this will be left to
the VM);
avoidance of manual handling of method dispatches (again, left to the
VM, the bytecode will merely give the VM its argument list on the stack,
likely using a "mark");
operators are mapped fairly directly;
built-in notions for: properties, operator overloading, typedefs,
delegates/function-pointers, scope-delegation (can be used for
Prototype-OO, implementing namespaces/import, ...), ...
...
so, in such a case, "a+b" always compiles to, say, "load a; load b;
add;", regardless of the types in use, and probably even whether or not
the operator takes the argument by-reference (although this would imply
that all "loads" are "load-by-reference" rather than the more
conventional "load-by-value").
note: one could canonize the use of "load-by-reference" semantics as
well, namely that the stack is not a stack of values, but rather a stack
of value-references. say, when an addition operator is called on a pair
of references, it results in a 3rd reference (to "somewhere") which in
turn holds the value (a real VM would likely, however, optimize away
most such needless references).
granted, not all languages look like C-like languages:
some don't have explicit operators, handling such operations instead via
function or method calls, ...
(me thinking here of Scheme and Smalltalk).
but, this shouldn't be a huge issue: they will map their constructs
instead to whatever they do use, and probably implement relevant
namespaces to implement their semantics. if done well, this still
shouldn't result in big ugly language-walls.
another key may be to largely separate "interface from implementation"
regarding many types, so two languages can see the same type, but
present different local interfaces for said type (the opposite direction
of the "everything is an object" concept, which seeks to assume that
"int" is an implementation of some "Integer" class). instead, a language
could alias operations to the type (as an extension of good-old operator
overloading) whereby things like method-calls to an integer can also be
intercepted/overloaded, essentially allowing the language to itself
implement its mapping between "int" and "some class named Integer".
another possible formalization would be, of course, that there are
two-such "classes", and all operations are implicitly method calls into
one of them (the languages' "view" of the type), with a second class
representing "the type as seen aliased to a class". of course, for this
to work would require a notion of "class" somewhat different from that
of the "traditional" OO notion of a class (as a single-point-definition
which may only be extended via overloading), to instead its conception
as "the aggregation of all operations visible for the class type"
(essentially, the "methods" would be more conceptually similar to that
of "named overloaded operators" than to that of traditional "virtual
class methods").
I think it would be easier just to have free-standing "named overloaded
operators accepting any number of arguments", which instead boils down
mostly to "overloaded functions" (operator overloading is, in effect,
merely a special case of a plain overloaded function). (the "add" opcode
in such a case could more be seen as a shorthand for a function-call to
such a function).
admittedly, this is basically how I had implemented operator overloading
anyways (partly, my VM is not nearly so generic, and so many operators
are hard-coded, and most operator handlers are "global").
(I could write more, but I am getting burnt out on writing this at the
moment...).
or such...
_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc