my effort had not gone nearly so high "up" the abstraction tree, but instead
operated in a space more like an abstracted x86 machine.
moving to a much higher level model, such as that of GCC IR or LLVM IR,
would likely be difficult to pull off effectively starting from "real"
machine code, such as x86.
what I had done was essentially just partly inverting several of the
low-level stages in the process:
assembly, since my assembler was mostly data-driven, disassembly is not
difficult using essentially the same data;
partly abstracting over matters of word-size and opcode argument forms;
...
then, it was this partially-abstracted form which was interpreted.
this was at a similar level of abstraction to that in my lower-level codegen
(namely dealing with registers and values as handles), rather than making it
all the way back up to a target-neutral IR.
converting to a higher-level IR would likely require something analogous to
a compiler+optimizer, namely to translate these decoded instructions into
generic IR sequences, and then try to optimize away all the cruft which
doesn't matter (such as all the "eflags" magic for sequences which don't
actually care about eflags).
...
the "eflags" issue is mostly because, for example, in x86 nearly every
conventional opcode modifies eflags, but in the majority of cases, these
changed flags are irrelevant (however, a forward scan and bit-masking could
likely allow for detecting cases where the modified flags are known to be
irrelevant).
also, x86 includes a small number of "very complex" opcodes, such as
"cpuid", which could be awkward if trying to produce an entirely generic IR
(since cpuid changes its behavior and results depending on the values
contained in certain registers), ...
this level of translation though is likely to either rule out or hinder the
use of self-modifying code, since SMC would essentially invalidate
previously translated sequences.
(in my case I had dealt with SMC simply by flushing the entire opcode cache,
which in this case was essentially just a big hash-table holding "opcode"
structures).
ideally, with a more complex "decode" process, the process of flushing on
SMC could be done cheaply and incrementally, rather than, say, essentially
having to recompile an app in memory continuously simply as it happens to be
self-modifying.
luckily, most executable code is marked as read-only, and SMC cases are
fairly rare, and so attempts at SMC are more often grounds for a simulated
GPF, rather than grounds for flushing the decode cache.
static translation, however, is likely to exclude the possiblity of SMC.
or such...
----- Original Message -----
From: "Monty Zukowski" <[email protected]>
To: "Fundamentals of New Computing" <[email protected]>
Sent: Tuesday, June 22, 2010 8:37 AM
Subject: Re: [fonc] Reverse OMeta and Emulation
GNU C was explicitly designed to make its intermediate representation
hard to work with. LLVM is a more practical choice.
Monty
On Mon, Jun 21, 2010 at 6:02 PM, Gerry J <[email protected]> wrote:
You may find the concept of semantic slicing relevant:
http://www.cse.dmu.ac.uk/~mward/martin/papers/csmr2005-t.pdf
There is software at:
http://www.cse.dmu.ac.uk/~mward/fermat.html
One possible path to explore is to take GNU C etc intermediate
representation of source as the "assembly language" of a VM and reverse
from
that to a more portable VM, as in Squeak or Java.
Perhaps Ometa could be combined in some way with FermaT to recognise
patterns and port legacy code to a fonc VM ?
Regards,
Gerry Jensen
02 9713 6004
_______________________________________________
fonc mailing list
[email protected]
http://vpri.org/mailman/listinfo/fonc