[fonc] ok, thoughts: VM design...

BGB Tue, 02 Dec 2008 23:19:30 -0800

well, in writing this I will at least try to organize my thoughts, and speakmore of "general ideas" than "implementation details" (granted, I am farmore inclined towards the latter than the former).

I will allow that probably many of those here will strongly disagree withall I have to say, but this is acceptable to me.

but, yes, I have before made this observation about the design of manythings, and this is especially true of many programming languages and VMs:the design is usually very centralized, and often to the exclusion of"everything else".



The Issue of Languages

so, a person will design a language, and they will apparently think tothemselves, "well, ok, since I am designing this language, and so I willthrow away all of these 'arcane' details common in other languages".

now, of course, no one can really agree what they want to throw away first.some will throw away the syntax ("don't need any of that C style crap"),others will throw away assignment, and yet others will throw away theability to do anything with said language.

so, they seek out some clean and elegant design, and produce something thatusually does not expand too far outside of some niche.

now, granted, this is not always the case, for example, many people adoptedJava and the JVM, even though it has many of the traits I will be objectingtoo, however I will also note that modern Java and the modern JVM hasessentially gone in a direction away from all this, even though many ofthese aspects have been retained in the core design.

it can also be noted that, as far as Java goes, it was one of the moreconservative of most of these languages (it had a relatively conventionalsyntax and featureset, ...), and the VM had an overall fairly practicaldesign (AKA: has a much lower performance and memory overhead than manyVMs).



Syntax and Generalities

so, in terms of language design, I will advocate the goal of being asconservative as is reasonable, and also appealing to the widest community asis reasonable.

so, for a end-user language, this almost demands an imperative OO languagewith C-style syntax (AKA: a typical user of C/C++/Java/C#/JS/... ifpresented with some code, should not be left asking themselves "ok, now justwhat the hell am I looking at?...").

and should they bother to learn to use the language, should not end up"walking on broken glass", as they discover bizarre behaviors andrestrictions which the designer had found "elegant".

now, of course, this does not mean I oppose adding "new" functionality to alanguage, only that it should be done in a way hopefully causing minimalimpact to more established conventions and practices.

so, for example, a C-family language can safely add many Functionalfeatures, such as closures, tail-call optimization, tail-expression returnvalues, ... provided they don't screw up ones' ability to do "ordinary"things.



I will use as an example, consider if it was acceptable to write:

function fib(x) if(x>2)fib(x-1)+fib(x-2) else 1;

but, at the same time, one could also write:
function fib(x)
{
   if(x<=2)return 1;
   return fib(x-1)+fib(x-2);
}

now, this does not mean every language has to use C style syntax, but thebig question is one of what the language is to do. for example, most peoplewill neither notice or care if the compiler IR or some specializedmicro-language has some obscure syntax and semantics filled with lots ofsharp edges, only that one in this case can't expect the "community atlarge" to accept this language.

ok, and it is also worth noting that as complicated as the syntax of mostC-family languages can be to parse, by far this is actually one of thesimplest parts of the whole process.

granted, yes, a trivial dynamically-typed bytecode interpreter is fairlyeasy to write vs a C-style parser, but for a statically-typed compiler andnative-code output, the process is by far more complicated.

consider, one writes say, like, 5 kloc for the parser, but then ends upwriting another 20 or 30 for the lower compiler stages and nativecodegen...



Dynamics and the Type-System

now, one can argue, "well then, why not just use dynamic types and bytecodeinterpreters?..."however, this misses that there are some rather notable good points tostatic typing, which are usually considered to outweigh the costs ofimplementation:

static verification, more room for optimization, ...

now, this is not meant to say that I think everyone should give up dynamiclanguages in favor of good old staticaly-typed batch-compiled languages,only that the advantages of static typing (even if used in the contex ofoptional annotations for an otherwise dynamic scripting language), will tendto outweigh the overt complexity added to the implementation.

I guess the point to make here is that many of us who "seriously" writecode, are more concerned with things working well than about minimizing thetotal lines of code that need to be written or the theoretical complexity ofthe entire system.

probably not like any of us probably like wasting bunches of extra effort onthings that wont pay off, but it is usually more useful IME to have aneffective component, rather than a simple or elegant component.

and, often, we like it more when our compilers will catch and return someobvious and stupid error (such as messing up the number of arguments,accidentally passing the wrong value, ...) than having to wait until somelater time when such errors might cause periodic runtime typecheck failures("BARF: foo/bar.scr:69: typecheck failed, method draw/2 not understood fortype CONS").

this is especially true considering that many times, any exceptions are notcaught, and usually the VM kills the entire app.

with full static verification, even if many parts of the app are dynamic,many possible bugs and potentially serious errors can be detected andreported up front than if these sorts of checks are not done, in the longrun saving the programmer time, and greatly improving the apps' quality andend-user experience (hint: messages like the above tend to be a nearconstant annoyance to users of apps written in a certain popular language,many of which could likely have been avoided had the compiler been able todo many obvious checks).

this does not mean, however, that I feel dynamic types are useless, rather,there are many cases where this kind of flexibility can be damn useful, onlythat, in general, I would far rather see wide use of dynamic capabilities inan otherwise static language, than the use of type-inference in a dynamiclanguage (type inference, though often doing a good job at optimizing, doesa poor job at catching programmer errors).

more so, if such a language does offer dynamic types, often the compilerproves more capable of figuring out what the types are (since they areusually much more firmly anchored to established static types), and so canmore capably report on probable violations of type safety.

this is in part due to the fact that, most of the time, what is reallyneeded is static typing, and dynamic typing is needed in a sufficientlyminor number of cases that I don't really believe it is justified as theprimary type model of a language.

yes, many of us really are apes on the keyboard, typing periodic andobviously stupid crap that we needs the compiler's help to notice.


so, C, C++, Java, and C# all do a reasonably good job here IMO.


Prototypes and Class/Instance

by a similar token, I will extend this argument to Class/Instance andPrototype OO, where even if Class/Instance is sometimes awkward and can't dosome things Prototype OO does easily, both the performance and verifiabilitymake C/I, in general, the more appropriate option.

granted, yes, the capabilities of prototypes are not something to bediscounted, but personally I would rather see these exist in terms of thegood old statically-typed class/instance model, than have to forsake bothfor these capabilities.

as a simple example, in the object system under development for my VM, C/Iis the primary model, however, both the dynamic addition of slots, and alsoobject delegation, are being made available as optional features (grantedthat, yes, the way I am implementing things is a little weird in somecases).

I could attempt to more elaborate on the design and semantics of my hybridmodel if anyone cares.



Language World

a second area of concern, and also something I feel is actually a biggerissue than that of the exact design of the language itself, is how easily itintegrates with the outside world.

it is very common in my experience for languages and VMs to be designed withthe assumption that the "entire world", as far as the developer isconcerned, is located inside the language and VM.

as a result, many of these languages are designed such that they don'tactually import or interface with the outside world in an at all directmanner.

instead, it is the idea that everything to be accessed from the language/VMeither has to be written in the language in question, or be systematicallywrapped in order to be visible to the VM.

another consequence is that mixed-language codebases are usually exceedinglyawkward (very often requiring large amounts of horrible looking and brittleboilerplate), and many such VMs are, by definition, often rather limited interms of the available capabilities (notable by just how often anddramatically many new VM efforts pronounce the new-gained ability to accessOpenGL or GTK, and very often with an an interface which is horriblystrained or mutilated due to the languages' inabilities to faithfully matchthe way the outside "C world" does things...).

the result is that many such VMs become a kind of "ivory tower",representing its own entire world and landscape, both structurally andsyntactically isolated from the outside world, and where the choice oflanguage also becomes one of associated code-world.

much effort is wasted, as each major VM/language implementor feels the needto rebuild the world from scratch in their own "new and innovative"language.

often, in an extreme form of this idea, the developers feel compelled tomake an implementation of the VM and compiler itself from within thislanguage, most often with this amounting to little more than an academiccuriousity (the version of the VM that everyone continues using typicallyhaving been written in C or C++).

of all the major VMs, nearly every one I am aware of, has these problems toa greater or lesser extent.yes, all this is typically inherent with implementing a VM, but I personallybelieve that this is one of those things that CAN and SHOULD be battledwithin the implementation of a VM framework.

we don't need to make a few of the VM's capabilities available by writingthe VM in itself, rather, I will assert, it is better to make the VMsufficiently capable with interfacing with the outside world, that it isalso capable of interfacing with its own implementation (even if it sohappens that the whole thing was written in, GASP, C or C++...).

from what I can tell, .NET is one of the few VMs which has addressed theseissues in a half-decent manner (though, IMO, .NET is still a long ways fromperfect in these regards...).



My Effort

this is actually one of the core areas I battle with in my effort, where Iam not so much trying to make a new world on top of the old one, so much asI am trying to leverage the existing world as much as possible, while stillpushing forwards with high-level abstractions and dynamically capabilities.

this is also a major reason why I had chosen as my main language, first andforemost, to implement a subset/superset of C99 (a few features were leftout for various reasons, but the language is sufficiently implemented thatmost C code should work without problems).

I also added a few compiler extensions for features which seemed like theycould be rather nifty to have, but in general (due to concern overmaintaining freedom to compile code with either my compiler or gcc), I havenot made so much use of these features as I would have thought. since then,more effort has been put into APIs (with a possible optimization route viaintrinsics), than into syntactic extensions (note that these extensions haveusually been added within the conformance guidelines set forth in thestandards).

as-is, the inner-world of dynamically-compiled C interfaces relativelyseamlessly with the outside world, and so, FWIW, the VM is written in itsown language.

as it also so happens, most of the languages' runtime features are alsoreadily accessible from good old statically compiled and linked code.


the main limitation of C, however, is with C itself:

compiling C code tends not to be neither the fastest nor most conservativewith memory, due in large part to the rather large amount of stuff typicallyincluded from various headers.

object caching is a partial fix to this problem, but does not address thewhole range of issues, and it would be nice to be able to compile scriptswithout causing a notable delay and typically the need to run the GC one ormore times...

likewise, precompiled headers are unlikely to be an adequate solution, sinceI am unaware of any real way to do this effectively apart from potentiallyviolating C's semantics (transforming the header-inclusion system more intoa kind of module-import system) and also requiring notable changes to how Iapproach compiling C code (making the middle and lower stages of thecompiler aware of such a module system).

eventually though, such changes may be useful and necessary (especially if Iwere to support such notions as having multiple runnable C-based"applications" on top of the same framework, say with each having its ownpartially-independent scope and world state, nevermind any obviousmultiplexing issues and likely introducing many of the same kinds ofproblems that plauged Windows 9x...).

so, for these reasons, both Java and JavaScript have been considered aslanguages worth implementing, with the desired goal of being able toimplement both while keeping the "wall of impedence" as small as possible(even granted that C, Java, and JS naturally operate in very differentworlds), and also I would like to do all this absent having to write lots ofugly biolerplate (JNI or otherwise) to make everything play well together.

recently, this issue has been one of the greater areas of concern, and hasled to the development of an object system which I am trying to make readilyuseful to all 3 languages, in C via a clean and hopefully not-too-awkwardAPI, and in the others' by handling the native semantics of the languages,and as well trying to address the issue of making code and data in 1 atleast presumably accessible from the other, without too much additionalwork.

granted, I have implemented JNI for the sake that it would be a good dealuseful for trying to import existing an classpath library (either GNUclasspath or the one from Apache, but I have yet to decide conclusively atthis point).

initially in my effort I have focused exclusively on generating native code(actually, this is backwards from most VMs, which generate bytecode firstand only later fallback to native code for sake of optimization), andalthough native code offers some fairly strong incentives, it is by no meansfree of issues. namely, that it is generally expensive to produce (at leastfrom C source...), and is (in my implementation) impervious to garbagecollection (garbage collecting the C toplevel and executable code is anotably more hairy issue than that of the heap).

so, in my implementation, with a few rare exceptions (specially craftedexecutable thunks located within heap-based memory) dynamically compiled orlinked-in code (such as static libraries) will not be garbage collected...

so, for these reasons I had decided (actually a few months ago now, but Idon't have much time for coding as of late, so it is slow) to implement abytecoded VM.

I had decided to make use of Java-ByteCode for the VM for various reasons(namely, that there are existing compilers that target it, and it is afairly "de-facto" bytecode format), but of course this means implementingmuch of a JVM.

note that I decided actually to write my own JVM, rather than re-using anexisting one, in the hope of creating yet another JVM which would create itsown "Java Island...", and also because I might want to target other thingsto it (for example, C).

so, the JVM is actually more of a template for a VM, than the entire design(my plans are actually to support a modified version of the bytecode, whichadd many features I feel necessary to "adequately" support languages such asJS and C).

granted, yes, some C compilers do target the JVM, but they do the even lessuseful thing of creating a new C island inside the JVM, which is hardlyuseful, where I would want bytecoded scripts to be able to freely accessC-land, which requires more than a few extensions to be practical.

granted... yes... the JVM classpath does have a few classes which have theneeded functionality, but to implement C in such a way would lead topitifully horrible performance.

my goal is, however, to retain backwards compatibility with moreconventional versions of the JVM, if anything so that hopefully I can makeuse of compilers like Eclipse and not have to needlessly write my own Javacompiler as well (granted... I am not certain at this point if Eclipse canbe used as a dynamic script compiler, having not looked much into it, but itwould make sense given the existing JVM supports this capability...).

it also remains as a possibility to include features for running AVM2bytecode as well (either via recompiling the bytecode or via separateloaders and a separate interpreter loop).

note that the JVM proper (and also likely for AVM2 support) would be a smallportion of the total project complexity (since, for the most part,everything is being based on already existing functionality from within theframework, and what parts I am implementing for the JVM should also applywithout too much effort to the AVM, me having noticed more than a few likelyexploitable similarities between the VM's...).

all this is not likely to be too much of a problem, since, after all, unlikemost VM frameworks mine is highly decentralized (most of the components areas different libraries, and many have little or no direct interaction withthe others and are designed to be replacable with another component so longas it serves a similar role/produces similar output/accepts similarinput/...).

there are actually several major components within all this that communicateprimarily via opaque representations (such as a specialized textualrepresentation or via XML).


or such...



_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc

[fonc] ok, thoughts: VM design...

Reply via email to