well, in writing this I will at least try to organize my thoughts, and speak more of "general ideas" than "implementation details" (granted, I am far more inclined towards the latter than the former).

I will allow that probably many of those here will strongly disagree with all I have to say, but this is acceptable to me.


but, yes, I have before made this observation about the design of many things, and this is especially true of many programming languages and VMs: the design is usually very centralized, and often to the exclusion of "everything else".


The Issue of Languages

so, a person will design a language, and they will apparently think to themselves, "well, ok, since I am designing this language, and so I will throw away all of these 'arcane' details common in other languages".

now, of course, no one can really agree what they want to throw away first. some will throw away the syntax ("don't need any of that C style crap"), others will throw away assignment, and yet others will throw away the ability to do anything with said language.

so, they seek out some clean and elegant design, and produce something that usually does not expand too far outside of some niche.


now, granted, this is not always the case, for example, many people adopted Java and the JVM, even though it has many of the traits I will be objecting too, however I will also note that modern Java and the modern JVM has essentially gone in a direction away from all this, even though many of these aspects have been retained in the core design.

it can also be noted that, as far as Java goes, it was one of the more conservative of most of these languages (it had a relatively conventional syntax and featureset, ...), and the VM had an overall fairly practical design (AKA: has a much lower performance and memory overhead than many VMs).


Syntax and Generalities

so, in terms of language design, I will advocate the goal of being as conservative as is reasonable, and also appealing to the widest community as is reasonable.

so, for a end-user language, this almost demands an imperative OO language with C-style syntax (AKA: a typical user of C/C++/Java/C#/JS/... if presented with some code, should not be left asking themselves "ok, now just what the hell am I looking at?...").

and should they bother to learn to use the language, should not end up "walking on broken glass", as they discover bizarre behaviors and restrictions which the designer had found "elegant".

now, of course, this does not mean I oppose adding "new" functionality to a language, only that it should be done in a way hopefully causing minimal impact to more established conventions and practices.

so, for example, a C-family language can safely add many Functional features, such as closures, tail-call optimization, tail-expression return values, ... provided they don't screw up ones' ability to do "ordinary" things.


I will use as an example, consider if it was acceptable to write:

function fib(x) if(x>2)fib(x-1)+fib(x-2) else 1;

but, at the same time, one could also write:
function fib(x)
{
   if(x<=2)return 1;
   return fib(x-1)+fib(x-2);
}


now, this does not mean every language has to use C style syntax, but the big question is one of what the language is to do. for example, most people will neither notice or care if the compiler IR or some specialized micro-language has some obscure syntax and semantics filled with lots of sharp edges, only that one in this case can't expect the "community at large" to accept this language.


ok, and it is also worth noting that as complicated as the syntax of most C-family languages can be to parse, by far this is actually one of the simplest parts of the whole process.

granted, yes, a trivial dynamically-typed bytecode interpreter is fairly easy to write vs a C-style parser, but for a statically-typed compiler and native-code output, the process is by far more complicated.

consider, one writes say, like, 5 kloc for the parser, but then ends up writing another 20 or 30 for the lower compiler stages and native codegen...


Dynamics and the Type-System

now, one can argue, "well then, why not just use dynamic types and bytecode interpreters?..." however, this misses that there are some rather notable good points to static typing, which are usually considered to outweigh the costs of implementation:
static verification, more room for optimization, ...

now, this is not meant to say that I think everyone should give up dynamic languages in favor of good old staticaly-typed batch-compiled languages, only that the advantages of static typing (even if used in the contex of optional annotations for an otherwise dynamic scripting language), will tend to outweigh the overt complexity added to the implementation.

I guess the point to make here is that many of us who "seriously" write code, are more concerned with things working well than about minimizing the total lines of code that need to be written or the theoretical complexity of the entire system.

probably not like any of us probably like wasting bunches of extra effort on things that wont pay off, but it is usually more useful IME to have an effective component, rather than a simple or elegant component.

and, often, we like it more when our compilers will catch and return some obvious and stupid error (such as messing up the number of arguments, accidentally passing the wrong value, ...) than having to wait until some later time when such errors might cause periodic runtime typecheck failures ("BARF: foo/bar.scr:69: typecheck failed, method draw/2 not understood for type CONS").

this is especially true considering that many times, any exceptions are not caught, and usually the VM kills the entire app.

with full static verification, even if many parts of the app are dynamic, many possible bugs and potentially serious errors can be detected and reported up front than if these sorts of checks are not done, in the long run saving the programmer time, and greatly improving the apps' quality and end-user experience (hint: messages like the above tend to be a near constant annoyance to users of apps written in a certain popular language, many of which could likely have been avoided had the compiler been able to do many obvious checks).


this does not mean, however, that I feel dynamic types are useless, rather, there are many cases where this kind of flexibility can be damn useful, only that, in general, I would far rather see wide use of dynamic capabilities in an otherwise static language, than the use of type-inference in a dynamic language (type inference, though often doing a good job at optimizing, does a poor job at catching programmer errors).

more so, if such a language does offer dynamic types, often the compiler proves more capable of figuring out what the types are (since they are usually much more firmly anchored to established static types), and so can more capably report on probable violations of type safety.

this is in part due to the fact that, most of the time, what is really needed is static typing, and dynamic typing is needed in a sufficiently minor number of cases that I don't really believe it is justified as the primary type model of a language.


yes, many of us really are apes on the keyboard, typing periodic and obviously stupid crap that we needs the compiler's help to notice.

so, C, C++, Java, and C# all do a reasonably good job here IMO.


Prototypes and Class/Instance

by a similar token, I will extend this argument to Class/Instance and Prototype OO, where even if Class/Instance is sometimes awkward and can't do some things Prototype OO does easily, both the performance and verifiability make C/I, in general, the more appropriate option.

granted, yes, the capabilities of prototypes are not something to be discounted, but personally I would rather see these exist in terms of the good old statically-typed class/instance model, than have to forsake both for these capabilities.

as a simple example, in the object system under development for my VM, C/I is the primary model, however, both the dynamic addition of slots, and also object delegation, are being made available as optional features (granted that, yes, the way I am implementing things is a little weird in some cases).

I could attempt to more elaborate on the design and semantics of my hybrid model if anyone cares.


Language World

a second area of concern, and also something I feel is actually a bigger issue than that of the exact design of the language itself, is how easily it integrates with the outside world.

it is very common in my experience for languages and VMs to be designed with the assumption that the "entire world", as far as the developer is concerned, is located inside the language and VM.

as a result, many of these languages are designed such that they don't actually import or interface with the outside world in an at all direct manner.

instead, it is the idea that everything to be accessed from the language/VM either has to be written in the language in question, or be systematically wrapped in order to be visible to the VM.

another consequence is that mixed-language codebases are usually exceedingly awkward (very often requiring large amounts of horrible looking and brittle boilerplate), and many such VMs are, by definition, often rather limited in terms of the available capabilities (notable by just how often and dramatically many new VM efforts pronounce the new-gained ability to access OpenGL or GTK, and very often with an an interface which is horribly strained or mutilated due to the languages' inabilities to faithfully match the way the outside "C world" does things...).


the result is that many such VMs become a kind of "ivory tower", representing its own entire world and landscape, both structurally and syntactically isolated from the outside world, and where the choice of language also becomes one of associated code-world.

much effort is wasted, as each major VM/language implementor feels the need to rebuild the world from scratch in their own "new and innovative" language.

often, in an extreme form of this idea, the developers feel compelled to make an implementation of the VM and compiler itself from within this language, most often with this amounting to little more than an academic curiousity (the version of the VM that everyone continues using typically having been written in C or C++).


of all the major VMs, nearly every one I am aware of, has these problems to a greater or lesser extent. yes, all this is typically inherent with implementing a VM, but I personally believe that this is one of those things that CAN and SHOULD be battled within the implementation of a VM framework.

we don't need to make a few of the VM's capabilities available by writing the VM in itself, rather, I will assert, it is better to make the VM sufficiently capable with interfacing with the outside world, that it is also capable of interfacing with its own implementation (even if it so happens that the whole thing was written in, GASP, C or C++...).


from what I can tell, .NET is one of the few VMs which has addressed these issues in a half-decent manner (though, IMO, .NET is still a long ways from perfect in these regards...).


My Effort

this is actually one of the core areas I battle with in my effort, where I am not so much trying to make a new world on top of the old one, so much as I am trying to leverage the existing world as much as possible, while still pushing forwards with high-level abstractions and dynamically capabilities.

this is also a major reason why I had chosen as my main language, first and foremost, to implement a subset/superset of C99 (a few features were left out for various reasons, but the language is sufficiently implemented that most C code should work without problems).

I also added a few compiler extensions for features which seemed like they could be rather nifty to have, but in general (due to concern over maintaining freedom to compile code with either my compiler or gcc), I have not made so much use of these features as I would have thought. since then, more effort has been put into APIs (with a possible optimization route via intrinsics), than into syntactic extensions (note that these extensions have usually been added within the conformance guidelines set forth in the standards).


as-is, the inner-world of dynamically-compiled C interfaces relatively seamlessly with the outside world, and so, FWIW, the VM is written in its own language.

as it also so happens, most of the languages' runtime features are also readily accessible from good old statically compiled and linked code.

the main limitation of C, however, is with C itself:
compiling C code tends not to be neither the fastest nor most conservative with memory, due in large part to the rather large amount of stuff typically included from various headers.

object caching is a partial fix to this problem, but does not address the whole range of issues, and it would be nice to be able to compile scripts without causing a notable delay and typically the need to run the GC one or more times...

likewise, precompiled headers are unlikely to be an adequate solution, since I am unaware of any real way to do this effectively apart from potentially violating C's semantics (transforming the header-inclusion system more into a kind of module-import system) and also requiring notable changes to how I approach compiling C code (making the middle and lower stages of the compiler aware of such a module system).

eventually though, such changes may be useful and necessary (especially if I were to support such notions as having multiple runnable C-based "applications" on top of the same framework, say with each having its own partially-independent scope and world state, nevermind any obvious multiplexing issues and likely introducing many of the same kinds of problems that plauged Windows 9x...).


so, for these reasons, both Java and JavaScript have been considered as languages worth implementing, with the desired goal of being able to implement both while keeping the "wall of impedence" as small as possible (even granted that C, Java, and JS naturally operate in very different worlds), and also I would like to do all this absent having to write lots of ugly biolerplate (JNI or otherwise) to make everything play well together.

recently, this issue has been one of the greater areas of concern, and has led to the development of an object system which I am trying to make readily useful to all 3 languages, in C via a clean and hopefully not-too-awkward API, and in the others' by handling the native semantics of the languages, and as well trying to address the issue of making code and data in 1 at least presumably accessible from the other, without too much additional work.

granted, I have implemented JNI for the sake that it would be a good deal useful for trying to import existing an classpath library (either GNU classpath or the one from Apache, but I have yet to decide conclusively at this point).


initially in my effort I have focused exclusively on generating native code (actually, this is backwards from most VMs, which generate bytecode first and only later fallback to native code for sake of optimization), and although native code offers some fairly strong incentives, it is by no means free of issues. namely, that it is generally expensive to produce (at least from C source...), and is (in my implementation) impervious to garbage collection (garbage collecting the C toplevel and executable code is a notably more hairy issue than that of the heap).

so, in my implementation, with a few rare exceptions (specially crafted executable thunks located within heap-based memory) dynamically compiled or linked-in code (such as static libraries) will not be garbage collected...


so, for these reasons I had decided (actually a few months ago now, but I don't have much time for coding as of late, so it is slow) to implement a bytecoded VM.

I had decided to make use of Java-ByteCode for the VM for various reasons (namely, that there are existing compilers that target it, and it is a fairly "de-facto" bytecode format), but of course this means implementing much of a JVM.

note that I decided actually to write my own JVM, rather than re-using an existing one, in the hope of creating yet another JVM which would create its own "Java Island...", and also because I might want to target other things to it (for example, C).


so, the JVM is actually more of a template for a VM, than the entire design (my plans are actually to support a modified version of the bytecode, which add many features I feel necessary to "adequately" support languages such as JS and C).

granted, yes, some C compilers do target the JVM, but they do the even less useful thing of creating a new C island inside the JVM, which is hardly useful, where I would want bytecoded scripts to be able to freely access C-land, which requires more than a few extensions to be practical.

granted... yes... the JVM classpath does have a few classes which have the needed functionality, but to implement C in such a way would lead to pitifully horrible performance.

my goal is, however, to retain backwards compatibility with more conventional versions of the JVM, if anything so that hopefully I can make use of compilers like Eclipse and not have to needlessly write my own Java compiler as well (granted... I am not certain at this point if Eclipse can be used as a dynamic script compiler, having not looked much into it, but it would make sense given the existing JVM supports this capability...).


it also remains as a possibility to include features for running AVM2 bytecode as well (either via recompiling the bytecode or via separate loaders and a separate interpreter loop).

note that the JVM proper (and also likely for AVM2 support) would be a small portion of the total project complexity (since, for the most part, everything is being based on already existing functionality from within the framework, and what parts I am implementing for the JVM should also apply without too much effort to the AVM, me having noticed more than a few likely exploitable similarities between the VM's...).

all this is not likely to be too much of a problem, since, after all, unlike most VM frameworks mine is highly decentralized (most of the components are as different libraries, and many have little or no direct interaction with the others and are designed to be replacable with another component so long as it serves a similar role/produces similar output/accepts similar input/...).


there are actually several major components within all this that communicate primarily via opaque representations (such as a specialized textual representation or via XML).

or such...



_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc

Reply via email to