[fonc] misc: bytecode and level of abstraction

BGB Sun, 22 Jan 2012 16:47:57 -0800

I don't know if this topic has probably been already beat to death, oris otherwise not very interesting or relevant here, but alas...

it is a question though what is the "ideal" level of abstraction (and"generality") in a VM.

for example, "LLVM" is fairly low level (using a statically-typedSSA-form as an IR, and IIRC a partially-decomposed type-system).the JVM is a little higher level, being a statically-typed stack machine(using "primitive" types for stack elements and operations), with anabstracted notion of in-memory class layout;MSIL/CIL is a little higher still, abstracting the types out of thestack elements (all operations work against inferred types, and unlikethe JVM there is no notion of "long and double take 2 stack slots", ...).

both the JVM and MSIL tend to declare types "from the POV of their pointof use", rather than from their point of declaration. hence, the "load"or "call" operations directly reference a location giving the type ofthe variable.

similarly, things like loads / stores / method-calls/dispatching / ...are resolved prior to emitting the bytecode.

in my VMs, I have tended to leave the types "at the point ofdeclaration", hence all the general load/store/call operations "merely"link to a symbolic-reference.

one of my attempts (this VM never got fully implemented) would haveattempted to pre-resolve all scoping (like in the JVM or .NET, but raninto problems WRT a complex scoping model), but I have not generallydone this.

my current VM only does so for the lexical scope, which is treatedconceptually as a stack:all variable declarations are "pushed" to the lexical environment, and"popped" when a given frame exits;technically, function arguments are pushed in left-to-right order,meaning that (counter-intuitively) their index numbers are reverse oftheir argument position;unlike in JBC or MSIL, the index does not directly reference a declaredvariables' declaration, merely its relative stack position, hence it isalso needed to "infer" the declaration;note that it being (conceptually) a stack also does not imply it isphysically also represented as a stack.

hence, in the above case, the bytecode not too far removed from thesource code.

I guess one can argue, that as one moves up the abstraction layer, thenthe amount of work needed in making the VM becomes larger (it deals withfar more semantics issues, and is arguably more specific to theparticular languages in use, ...).

I suspect it is much less clear cut than this though, for example,targeting a dynamic-language (such as Scheme or JavaScript) to a VM suchas LLVM or JBC (pre JDK7) essentially requires implementing much of theVM within the VM, and may ultimately reduce how effectively the VM canoptimize the code (rather than merely dealing with the construct, nowthe first VM also has to deal with how the second VM's constructs wereimplemented on top of the first VM).

a secondary issue is when the restrictions of such a VM (particularlythe JVM) impede what can be effectively expressed within the VM, runningcounter to the notion that higher abstraction necessarily equates togreater semantic restrictions.

the few cases where I can think of where the argument does make adifference include:

the behavior of variable scoping (mostly moot for JVM, which pretty muchhard-codes this);the effects of declaration modifiers (moot regarding JVM and .NET, whichmanage modifiers internally).

the "shape" of the type-system and numeric tower (likewise as the above,although neither "enforces" a particular type-system, neither gives muchroom for it to be effectively done much differently, likewise in LLVMand ASM one is confined to whatever is provided by the HW).

the behavior of specific operators as applied to specific types. thismay be a merit of the JVM and .NET arguably vs my own VMs, since bothVMs only perform operations directly against primitive types, thebehavior of mixed-type cases is de-facto left to the language andcompiler, this may be ultimately a moot point, as manual type-coercionor scope-qualified operator overloading could achieve the same ends.similarly, a high-level VM could also (simply) discard the notion ofbuilt-in/hard-coded operator+type semantics, and instead expect thecompiled code to either overload operators or import a namespacecontaining the desired semantics (say, built-in or library-suppliedoverloaded operators). more-so, unlike the JVM and .NET strategies, thisdoes not mandate the need for static typing (prior to emitting bytecode)in order to achieve language-specific type-semantics.

in the above case (operators being a result of an implicit import), ifLanguage-A disallows "string+int", Language-B interprets it as "appendthe string(a) with int::toString(b)", and Language-C as "offset thestring by int chars", well then, the languages can each do so withoutinterfering with the others.

...

or, in effect, I am seeing fairly little compelling reason (apart fromsimplicity of the VM runtime) for why a lower-level VM representationwould be necessarily preferable to a higher-level one.

one may almost as well just make a VM represent a distilled-down versionof C++ style semantics (probably with some extensions and omissions, andrepresented as bytecode), and make the requirement for other languages"figure out how to compile your code into working with C++ likesemantics...", this being in contrast to the alternative trend which isto make VM's which look (more or less) like brain-damaged versions ofassembler.



so, probable features of such a VM architecture:

probably bytecode, and probably a stack machine (any "good" reason to dootherwise?);

probably type-inferred values;

types are not declared at their point of use (they are declared at theirpoint of declaration);avoidance of explicitly generated type coercions (this will be left tothe VM);avoidance of manual handling of method dispatches (again, left to theVM, the bytecode will merely give the VM its argument list on the stack,likely using a "mark");

operators are mapped fairly directly;

built-in notions for: properties, operator overloading, typedefs,delegates/function-pointers, scope-delegation (can be used forPrototype-OO, implementing namespaces/import, ...), ...

...

so, in such a case, "a+b" always compiles to, say, "load a; load b;add;", regardless of the types in use, and probably even whether or notthe operator takes the argument by-reference (although this would implythat all "loads" are "load-by-reference" rather than the moreconventional "load-by-value").

note: one could canonize the use of "load-by-reference" semantics aswell, namely that the stack is not a stack of values, but rather a stackof value-references. say, when an addition operator is called on a pairof references, it results in a 3rd reference (to "somewhere") which inturn holds the value (a real VM would likely, however, optimize awaymost such needless references).



granted, not all languages look like C-like languages:

some don't have explicit operators, handling such operations instead viafunction or method calls, ...

(me thinking here of Scheme and Smalltalk).

but, this shouldn't be a huge issue: they will map their constructsinstead to whatever they do use, and probably implement relevantnamespaces to implement their semantics. if done well, this stillshouldn't result in big ugly language-walls.

another key may be to largely separate "interface from implementation"regarding many types, so two languages can see the same type, butpresent different local interfaces for said type (the opposite directionof the "everything is an object" concept, which seeks to assume that"int" is an implementation of some "Integer" class). instead, a languagecould alias operations to the type (as an extension of good-old operatoroverloading) whereby things like method-calls to an integer can also beintercepted/overloaded, essentially allowing the language to itselfimplement its mapping between "int" and "some class named Integer".

another possible formalization would be, of course, that there aretwo-such "classes", and all operations are implicitly method calls intoone of them (the languages' "view" of the type), with a second classrepresenting "the type as seen aliased to a class". of course, for thisto work would require a notion of "class" somewhat different from thatof the "traditional" OO notion of a class (as a single-point-definitionwhich may only be extended via overloading), to instead its conceptionas "the aggregation of all operations visible for the class type"(essentially, the "methods" would be more conceptually similar to thatof "named overloaded operators" than to that of traditional "virtualclass methods").

I think it would be easier just to have free-standing "named overloadedoperators accepting any number of arguments", which instead boils downmostly to "overloaded functions" (operator overloading is, in effect,merely a special case of a plain overloaded function). (the "add" opcodein such a case could more be seen as a shorthand for a function-call tosuch a function).

admittedly, this is basically how I had implemented operator overloadinganyways (partly, my VM is not nearly so generic, and so many operatorsare hard-coded, and most operator handlers are "global").

(I could write more, but I am getting burnt out on writing this at themoment...).



or such...

_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc

[fonc] misc: bytecode and level of abstraction

Reply via email to