On Thursday 03 January 2002 08:40 pm, Paul Baranowski wrote: > Hi - > I love what you guys are doing with Parrot. I was just recently > wondering if it would be possible to transform a program compiled down > to machine language into byte-code, thereby automatically porting any > app to any other machine (at least any statically-compiled app). Does > anyone see any technical problems in doing this?
Let's tackle the difficulties in order. (I apologize if this will seem flippant. Not the intention.) First, we need to be able to read the program format: ELF, dwarf, .exe, what-have-you. This is relatively clear-cut - a lot of metadata, your different segments, etc. Some of the info may not be pertinent, some may not map well. We'll assume we find a valid use for what we have, and that we can produce anything we need. (After all, it had to have been producable at some point.) Next, we need to understand the program inside the program - the opcode instructions. This is mostly straightforward. This code is this instruction, these two are its arguments (which are these), so on and so forth. We should be able to figure out what operations are occuring on what data with relative ease. Now, we have to figure out the semantics of those operations. Some of them are quite obvious - when we divide an integer by an integer, we are going to get an integer back. Some aren't. When we make the system call "fstat", for instance, what exactly does that mean? Plus, we have to figure out what the data means. Is this a path? Do we need to change the path delimiter? Does the path refer to an absolute location - /home - or some effective location - the home directories? This is largely up to the programmer for providing portable, consistent data, so we shall not address it. Data is *always* the scourge of portability. Let's assume that we've got some reasonable results from the above steps, and we're ready to port to our byte-code model. Working in reverse order, we need to make sure that we can reproduce the semantics of the program. Does this mean that we need to call the system call "fstat", and use whatever it decides to return? What if it doesn't exist? What about interpreting the results so that the remaining semantics stay true to the original program's intent? Does it mean that we need to convert to bytecode that retrieves the stat info, no matter how that may be, so that we may preserve the semantics directly? What of information that may not be applicable from one platform to the next? It's safe to assume that opcodes are not going to map one-to-one. So we now need to take the opcode stream, with all its arguments and semantics, and map them somehow onto bytecode. This part should be relatively straight-forward, assuming that everything is mappable. However, there are bound to be some sticky parts. What about op-and-a-half ops? (Bytecode that does slightly more than one native op, and slightly less than two.) Do argument sizes need to come into play? Now, you need to reassemble your mapped bytecode to a format the interpreter can read. Maybe some fixup stuff. Maybe just some packing things in there. Presto. If you're successful, you've a portable program. But let's think about this a little further. If you're able to deconstruct problems 1, 2, and 3 for a particular platform, should you be able to construct the reverse for the same? It should be an issue of one-to-many mappings not being inversible, because you don't need to find *the* original program, simply one of the many. The problems are exactly as described for the virtual machine - it is, after all, a machine. This is basically what machine emulation is - it's what allows me to play my old Commodore 64 favorites on my Linux box. Or Amiga. Or Atari 2600. Or Atari ST. An example of reducing many different frontends to a single backend. Of course, solving it for bytecode requires a single solution, whereas solving it for multiple platforms requires multiple solutions. But this is exactly what GCC does from its intermediate format to the eventual platform binary that it produces. The obvious conclusion, then, is why limit yourself to an interpreted bytecode stream, when you can have native speed and still be portable? That, in turn, begs the question, why isn't it being done? The obvious answer is that we do, but that we just take shortcuts. Since steps one through four would be so much simpler if the platforms were similar (or identical), we create our own virtual platform, with its own program format and semantics - source code in a standardized language. The other obvious answer is that we don't, because it's just to difficult, if not impossible, to do, and do well. So is it worth the effort to pursue? -- Bryan C. Warnock [EMAIL PROTECTED]