Re: Oprofile, swap

John Richard Moser Tue, 18 Dec 2007 15:26:56 -0800

(Note: most of this message isn't very useful probably; it's about theoretical software architecture, that nobody's going to implement, that I can't prove, that I'm not really 100% sure about. Still, if you WANT to read it, hey... remember, bad ideas sometimes get corrected by people who are smart enough to turn them into GOOD ideas)



Ivan Krstić wrote:

On Dec 18, 2007, at 12:27 PM, Jameson Chema Quinn wrote:
Has anyone looked at Psyco on the XO?
Psyco improves performance at the cost of memory. On a memory-constrained machine, it's a tradeoff that can only be made in laser-focused, specific cases. We have not done the work -- partly for

It would be wise to throw out the idea of laser-focusing which engine to use. Think of memory costs for running multiple versions of Python. Then again, what IS wise?

Any such system needs to efficiently use memory. I like the idea of one based on Mono since it has that whole compacting garbage collector, which (although being a cache destroyer by nature) at least shrinks down memory usage. Of course then you still have Mono on top of it, and CIL code that's been generated, and reflected, and JIT'd, which means you (again) have 2 interpreters in memory (one in CIL, one being Mono itself), and one gets dynamic recompiled (the Python one in CIL), and all the intermediary (CIL) code gets kept for later profiling and optimization...

... didn't I say before I hate the concept of JIT dynamic compilation? Interpreters just suck by nature due to dcache problems (code becomes data, effectively your instruction working set is fixed, the load doesn't go onto icache and dcache both as the program gets bigger...) and due to the fact that you have to do a LOT of work to decode an insn in software (THINK ABOUT QEMU). Interpreters for specific script languages like Python and Perl have the advantage of not having to be a general CPU emulator, so they can have instructions that are just function calls that go into native code.



So execution time order:

Native code <  // *1
JIT < // *2
Specific language interpreter < // *3
General bytecode interpreter < // *4
Parser script interpreter // *5

*1: Native code. C, obj-C, something compiled. everything else I could mention is out of date.

*2: Technically JIT is native code, but there's also extra considerations with memory use and cache pressure comes into play slightly. After the ball gets rolling it just eats more memory but cache and execution speed are fine.

*3: A specific language interpreter might call a native code strcpy() function instead of have an insn for CALL that goes into a bytecode implementation of strcpy(), or having an insn for CALL that goes into a bytecode strcpy() that just sets up a binding and calls real native strcpy(). The interpreter would head straight for native land, going "function foobar() gets assigned token 0x?? and I'll know what to do when I see it."

*4: A general CPU interpreter is going to have to be a CPU emulator. Java and Mono count, for Java and CIL CPUs. These CPUs don't really exist but those interpreters work that way, they even have their own assembly.

*5: Some script engines are REALLY FREAKING DUMB and actually send each line through a parser every time they see it, which is megaslow. These usually don't last, or just function as proof of concept until a real bytecode translator gets written to make a specific language interpreter.

Maybe, MAYBE by twiddling with a JIT, you could convince it to discard generated bytecode. For example, assuming we're talking about a Python implementation on top Mono, and we can modify Mono any way we want with reasonably little effort:


 - Python -> internal tree (let's say Gimple, like gcc)
 - Gimple -> optimizer (Python)
 - Gimple (opt) -> optimizer (general)
 - Gimple (opt) -> CIL data (for reflection)
 - FREE:  Gimple
 - CIL (data) -> Reflection (CIL)
 - FREE:  CIL data (for reflection)
 - CIL -> CIL optimizer
 - CIL (opt) -> JIT (x86)
 - While (not satisfied)
   - The annoying process of dynamic profiling
   - CIL (opt, profiled) -> JIT (x86)
 - FREE:  CIL

NOTE: at the FREE CIL data step, we are talking about the Python interpreter freeing the CIL data that it has; Mono has now loaded a copy as CIL code, we don't need to give it to it again, we're done with it.


At this point we should have:

 - A CIL program for a Python interpreter
 - A CIL interpreter (Mono)
 - x86 native code for the program

Further, you should be able to make the Python interpreter do a number of things:


 - Translate any Python-written libraries via JIT on a method-for-method
   basis
 - Translate Python bindings (Python calling C) to active CIL bindings
   (to avoid calling back to the interpreter)
 - Unload most of itself when done (say, when it's been unused for about
   5 minutes of execution time), save for a method that loads the Python
   interpreter back into memory AGAIN when a new method gets called, so
   that it can dynamic-compile it.

Thus, you should be able to achieve the near total elimination of the CIL program for a Python interpreter and just leave Mono and the program itself, already JIT'd to native code, in memory. You would need a fragment of the Python interpreter loaded to handle any entry back into the Python interpreter, with a single function to load it again; each re-entry point would just lead the whole engine and then jump to the actual handler in it. This of course isn't much (if you need a whole page of code for that I'm surprised).

Mind you there's a number of flaws in this argument. You probably noticed most of them.


 - IronPython is of not entirely acceptable license; nobody is going to
   make a SECOND Python/CIL dynamic compiler
 - You can get an IL stream for any compiled method.  Mono won't free
   CIL stuff.  It may actually be small enough not to care; or not.
   I believe it's actually too big to be feasible.  MAYBE you can add
   something to Mono to allow flushing that permanently on purpose (i.e.
   by the Python interpreter).
 - You're still dealing with JIT'd code, which is still not shareable.
   Mono seems to put that in WX segments, so by counting these (in
   kilobytes) I can ascertain the exact size of the executable code.
   Because it's not shared, it doesn't get evicted from memory if not
   used like normal .so files or /bin executables.  I have a memory
   analysis script I wrote that does the trick, if I have bash play with
   the output; here's what Tomboy looks like on x86-64, about 13MB:

   $ echo $(( $(~/memuse.sh 19132 | grep "p:wx" | cut -f1 -d' ' | \
      tr -d 'K' | tr -d 'B' | xargs | sed -e "s/ / + /g" ) ))
   13544

I like to think of programs like kernels, or kernels like programs. Either way, I like to treat applications like microkernels. In the embedded scene, this may actually be critical; maybe you should think that way for the XO, in a little part. (Re: the part about unloading the entire Python interpreter except for a little bit that reloads it if needed...)

--
Ivan Krstić <[EMAIL PROTECTED]> | http://radian.org


--
Bring back the Firefox plushy!
http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good
https://bugzilla.mozilla.org/show_bug.cgi?id=322367

memuse.sh
Description: application/shellscript

_______________________________________________
Devel mailing list
[email protected]
http://lists.laptop.org/listinfo/devel

Re: Oprofile, swap

Reply via email to