A nice mix is to do memory management and logic (e.g. decide when to stop iterating) in Smalltalk and to have C-callable "primitives" for the heavy loops. A great way to reach the latter is to define the functions using extern "C" - then you can use C++ features (streams, templates) in the function bodies. IMHO, C++ with some suitable operator overloading does a fairly nice job of formula translation, and it is a good fit for fixed size arithmetic.
If Cog can make the above optional, so much the better. ________________________________________ From: [email protected] [[email protected]] On Behalf Of John B Thiel [[email protected]] Sent: Thursday, February 17, 2011 9:21 AM To: [email protected] Subject: [Pharo-project] Cog VM -- Thanks and Performance / Optimization Questions Cog VM -- Thanks and Performance / Optimization Questions To Everyone, thanks for your great work on Pharo and Squeak, and to Eliot Miranda, Ian Piumarta, and all VM/JIT gurus, especially thanks for the Squeak VM Cog and its precursors, which I was keenly anticipating for a decade or so, and is really going into stride with the latest builds. I like to code with awareness of performance issues. Can you tell or point me to some performance and efficiency tips for Cog and the Squeak compiler -- detail on which methods are inlined, best among alternatives, etc. For example, I understand #to:do: is inlined -- what about #to:do:by: and #timesRepeat and #repeat ? Basically, I would like to read a full overview of which core methods are specially optimized (or planned). I know about the list of NoLookup primitives, as per Object class>>howToModifyPrimitives, supposing that is still valid? What do you think is a reasonable speed factor for number-crunching Squeak code vs C ? I am seeing about 20x slower in the semi-large scale, which surprised me a bit because I got about 10x on smaller tests, and a simple fib: with beautiful Cog is now about 3x (wow!). That range, 3x tiny tight loop, to 20x for general multi-class computation, seems a bit wide -- is it about expected? My profiling does not reveal any hotspots, as such -- it's basically 2, 3, 5% scattered around, so I envision this is just the general vm/jit overhead as you scale up -- referencing distant objects, slots, dispatch lookups, more cache misses, etc. But maybe I am generally using some backwater loop/control methods, techniques, etc. that could be tuned up. e.g. I seem to recall a trace at some point showing #timesRepeat taking 10% of the time (?!). Also, I recall reading about an anomaly with BlockClosures -- something like being rebuilt every time thru the loop - has that been fixed? Any other gotchas to watch for currently? (Also, any notoriously slow subsystems? For example, Transcript writing is glacial.) The Squeak bytecode compiler looks fairly straightforward and non-optimizing - just statement by statement translation. So it misses e.g. chances to store and reuse, instead of pop, etc. I see lots of redundant sequences emitted. Are those kind of things now optimized out by Cog, or would tighter bytecode be another potential optimization path. (Is that what the Opal project is targetting?) -- jbthiel
