On Thursday 03 January 2002 08:40 pm, Paul Baranowski wrote:
> Hi -
> I love what you guys are doing with Parrot.  I was just recently
> wondering if it would be possible to transform a program compiled down
> to machine language into byte-code, thereby automatically porting any
> app to any other machine (at least any statically-compiled app).  Does
> anyone see any technical problems in doing this?

Let's tackle the difficulties in order.  (I apologize if this will seem 
flippant.  Not the intention.)

First, we need to be able to read the program format: ELF, dwarf, .exe, 
what-have-you.  This is relatively clear-cut - a lot of metadata, your 
different segments, etc.  Some of the info may not be pertinent, some may 
not map well.  We'll assume we find a valid use for what we have, and that 
we can produce anything we need.  (After all, it had to have been producable 
at some point.)

Next, we need to understand the program inside the program - the opcode 
instructions.  This is mostly straightforward.  This code is this 
instruction, these two are its arguments (which are these), so on and so 
forth.  We should be able to figure out what operations are occuring on what 
data with relative ease.

Now, we have to figure out the semantics of those operations.  Some of them 
are quite obvious - when we divide an integer by an integer, we are going to 
get an integer back.  Some aren't.  When we make the system call "fstat", 
for instance, what exactly does that mean?

Plus, we have to figure out what the data means.  Is this a path?  Do we 
need to change the path delimiter?  Does the path refer to an absolute 
location - /home - or some effective location - the home directories?  This 
is largely up to the programmer for providing portable, consistent data, so 
we shall not address it.  Data is *always* the scourge of portability.

Let's assume that we've got some reasonable results from the above steps, 
and we're ready to port to our byte-code model.  Working in reverse order,
we need to make sure that we can reproduce the semantics of the program.  
Does this mean that we need to call the system call "fstat", and use 
whatever it decides to return?  What if it doesn't exist?  What about 
interpreting the results so that the remaining semantics stay true to the 
original program's intent?  Does it mean that we need to convert to bytecode 
that retrieves the stat info, no matter how that may be, so that we may 
preserve the semantics directly?  What of information that may not be 
applicable from one platform to the next?

It's safe to assume that opcodes are not going to map one-to-one.  So we now 
need to take the opcode stream, with all its arguments and semantics, and 
map them somehow onto bytecode.  This part should be relatively 
straight-forward, assuming that everything is mappable.  However, there are 
bound to be some sticky parts.  What about op-and-a-half ops?  (Bytecode 
that does slightly more than one native op, and slightly less than two.) Do 
argument sizes need to come into play?

Now, you need to reassemble your mapped bytecode to a format the interpreter 
can read.  Maybe some fixup stuff.  Maybe just some packing things in there.
Presto.  If you're successful, you've a portable program.

But let's think about this a little further.  If you're able to deconstruct 
problems 1, 2, and 3 for a particular platform, should you be able to 
construct the reverse for the same?  It should be an issue of one-to-many 
mappings not being inversible, because you don't need to find *the* original 
program, simply one of the many.  The problems are exactly as described for 
the virtual machine - it is, after all, a machine.  

This is basically what machine emulation is - it's what allows me to play my 
old Commodore 64 favorites on my Linux box.  Or Amiga.  Or Atari 2600.  Or 
Atari ST.  An example of reducing many different frontends to a single 
backend.

Of course, solving it for bytecode requires a single solution, whereas 
solving it for multiple platforms requires multiple solutions.  But this is 
exactly what GCC does from its intermediate format to the eventual platform 
binary that it produces.

The obvious conclusion, then, is why limit yourself to an interpreted 
bytecode stream, when you can have native speed and still be portable?
That, in turn, begs the question, why isn't it being done?

The obvious answer is that we do, but that we just take shortcuts.  Since 
steps one through four would be so much simpler if the platforms were 
similar (or identical), we create our own virtual platform, with its own 
program format and semantics - source code in a standardized language. 

The other obvious answer is that we don't, because it's just to difficult, 
if not impossible, to do, and do well.  So is it worth the effort to pursue?

-- 
Bryan C. Warnock
[EMAIL PROTECTED]

Reply via email to