Re: [Factor-talk] Factor-talk Digest, Vol 42, Issue 10

Slava Pestov Tue, 13 Oct 2009 22:09:36 -0700

On Tue, Oct 13, 2009 at 10:04 PM, Hugh Aguilar <[email protected]> wrote:
> I said in the book that Factor makes an excellent prototyping language for
> Forth. The programmer can get a prototype working in Factor, and only
> concern himself with the integer-overflow issue when he is porting it to
> Forth for the production version. This is better than trying to write a
> program from scratch, which involves a lot of high-level thinking, while
> simultaneously dealing with a lot of low-level issues --- as done in Forth
> right now.


One nice thing is that in Factor you can write high-level arithmetic
code that still performs fast, because the compiler performs type
inference as well as take advantage of various arithmetic identities
to eliminate unneeded generality.

For example, suppose you have this piece of code:

255 bitand 300 * 17 +

Naively, it would need to do dispatch between bignums and fixnums in
'bitand', '*' and '+', and * and + would have overflow checks.
However, the compiler knows that the output of '255 bitand' is in a
fixnum in the range 0..255, and the rest follows:

( scratchpad ) [ 255 bitand 300 * 17 + ] optimized.
[
    255 >R >fixnum R> fixnum-bitand 300 fixnum*fast
    17 fixnum+fast
]

Notice that the 'optimized.' word rewrote the high-level Factor code
into something that still looks sort of like Factor, but seemingly
uses low-level words whose names suggest they are primitive operations
of some sort.

These low-level words map more or less to machine instructions, but
not before some more optimization passes which use registers instead
of the stack. We can ask the optimizer to go down a level and show us
the code after conversion to register-based form.

( scratchpad ) [ 255 bitand 300 * 17 + ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0
_prologue T{ stack-frame { total-size 32 } }
_label 1
##load-immediate RAX 2040
##inc-r 1
##replace RAX R 0
_label 2
##call >fixnum
_label 3
##peek RAX D 0
##peek RCX R 0
##and RAX RAX RCX
##mul-imm RAX RAX 300
##add-imm RAX RAX 136
##inc-r -1
##replace RAX D 0
_label 4
_epilogue T{ stack-frame { total-size 32 } }
##return
_spill-area-size 0

There is a subroutine call to convert the first input to a fixnum (the
##call instruction), but the rest of the operations are done with
machine instructions without even touching the stack, except to load
input values and store the result. Intermediate values are in
registers and the constants are immediate operands.

Those ##instructions are not bytecode for an interpreter or anything;
they map more or less to x86 or PowerPC instructions:

( scratchpad ) [ 255 bitand 300 * 17 + ] disassemble
000000010f0aa8e0: 49b8e0a80a0f01000000  mov r8, 0x10f0aa8e0
000000010f0aa8ea: 6820000000            push dword 0x20
000000010f0aa8ef: 4150                  push r8
000000010f0aa8f1: 4883ec08              sub rsp, 0x8
000000010f0aa8f5: 48b8f807000000000000  mov rax, 0x7f8
000000010f0aa8ff: 4983c708              add r15, 0x8
000000010f0aa903: 498907                mov [r15], rax
000000010f0aa906: e8d59cdbfe            call 0x10de645e0
000000010f0aa90b: 498b06                mov rax, [r14]
000000010f0aa90e: 498b0f                mov rcx, [r15]
000000010f0aa911: 4821c8                and rax, rcx
000000010f0aa914: 4869c02c010000        imul rax, rax, 0x12c
000000010f0aa91b: 4881c088000000        add rax, 0x88
000000010f0aa922: 4983ef08              sub r15, 0x8
000000010f0aa926: 498906                mov [r14], rax
000000010f0aa929: 4883c418              add rsp, 0x18
000000010f0aa92d: c3                    ret

You were concerned about the efficiency of 2bi in your book. Compare
this code that uses 2bi:

( scratchpad ) [ [ + ] [ * ] 2bi ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0
_prologue T{ stack-frame { total-size 32 } }
_label 1
##peek RAX D 1
##peek RCX D 0
##inc-r 2
##replace RAX R 1
##replace RCX R 0
_label 2
##call +
_label 3
##peek RAX R 1
##peek RCX R 0
##inc-d 2
##inc-r -2
##replace RAX D 1
##replace RCX D 0
_label 4
_epilogue T{ stack-frame { total-size 32 } }
##jump *
_spill-area-size 0

With the same expression with 2dup/-rot:

( scratchpad ) [ 2dup + -rot * ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0
_prologue T{ stack-frame { total-size 32 } }
_label 1
##peek RAX D 1
##peek RCX D 0
##inc-d 2
##replace RAX D 1
##replace RCX D 0
_label 2
##call +
_label 3
##peek RAX D 2
##peek RCX D 1
##peek RDX D 0
##replace RDX D 2
##replace RAX D 1
##replace RCX D 0
_label 4
_epilogue T{ stack-frame { total-size 32 } }
##jump *
_spill-area-size 0

In the next example, there's no stack operations inside the loop at
all, despite the fact that 'times' is a library word implemented in
terms of dozens of quotation-constructing and quotation-consuming
words -- they all get optimized down into a loop by the compiler:

( scratchpad ) [ 1 100000 [ 17 + 29 * ] times >fixnum ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0
_label 1
##load-immediate RAX 8
##load-immediate RCX 800000
##load-immediate RDX 0
##inc-d 3
_branch 3
_label 2
##add-imm RAX RAX 136
##mul-imm RAX RAX 29
##add-imm RDX RDX 8
_label 3
_compare-branch 2 RDX RCX cc<
_label 4
##inc-d -2
##replace RAX D 0
_label 5
##return
_spill-area-size 0

Here is the machine code for the above loop. The data stack pointer is
in r14 -- can you spot where its used?

( scratchpad ) [ 1 100000 [ 17 + 29 * ] times >fixnum ] disassemble
000000010f0172a0: 48b80800000000000000  mov rax, 0x8
000000010f0172aa: 48b900350c0000000000  mov rcx, 0xc3500
000000010f0172b4: 4831d2                xor rdx, rdx
000000010f0172b7: 4983c618              add r14, 0x18
000000010f0172bb: e90f000000            jmp 0x10f0172cf
000000010f0172c0: 4881c088000000        add rax, 0x88
000000010f0172c7: 486bc01d              imul rax, rax, 0x1d
000000010f0172cb: 4883c208              add rdx, 0x8
000000010f0172cf: 4839ca                cmp rdx, rcx
000000010f0172d2: 0f8ce8ffffff          jl dword 0x10f0172c0
000000010f0172d8: 4983ee10              sub r14, 0x10
000000010f0172dc: 498906                mov [r14], rax
000000010f0172df: c3                    ret

These optimizations are actually quite basic compared to what Factor
does with floating point and SIMD code.

Factor's static stack checker is what actually enables most of the
interesting optimizations, in addition to making your code more
rebust.

In general, I think you need to take a look at modern language
implementation techniques before concluding that high-level
abstractions make code slow. Factor's garbage collection makes
allocating a new object very cheap -- its just is a simple pointer
bump. There is no fragmentation and no searching or maintenance of
free lists because memory is compacted, and pointers are efficiently
updated, on each collection. The machine code for allocating objects
can be inlined in many cases, and it is only 6 instructions. This is
much faster than a call to malloc(), or the ANS Forth equivalent,
ALLOCATE.

Any object system you implement on top of a Forth interpreter will
always be slower than Factor's, which has static compiler
optimizations that inline methods where possible, and inline caches
for dynamic call sites. PICs generate customized dispatch stubs for
method call sites which depend on what types show up at runtime. Doing
this without garbage collection of compiled code blocks would be such
a pain. The best you can get with Forth (and C++) is vtable method
dispatch, which has worse locality and uses indirect jumps rather than
direct jumps.

> a Forth. Even if you like Joy, with its DIP and all that, this is immaterial
> to the main purpose of programming, which is to write hella-fast programs.
> That is what the customer wants.

DIP helps you write fast programs in less time with fewer bugs.

In Factor and Joy,

[ ... ] dip

is equivalent to

>r ... r>

in Forth. The main difference is that with >r/r>, if you don't balance
them properly within a single word, your program is going to crash
spectacularly because you jump to a random integer on the return
stack. In Factor, the worst that will happen is you mess up and forget
the closing ], in which case its a syntax error.

The Forth programmer has this needless complication and source of
errors that is not present in Factor, because Factor formulates the
problem of 'hiding a value' in terms of a meaningful abstraction (the
'dip' combinator takes a quotation and runs it without the top of the
stack) instead of the implementation detail (the value is actually
moved to an auxilliary holding area, that happens to also be used for
return addresses, then moved back before a return address is needed).

Factor converts a 'dip' into lower-level code under the hood at a very
early stage in the compiler,

( scratchpad ) [ [ 1 + ] dip ] optimized.
[ >R 1 + R> ]

These words cannot be called called directly (optimized. doesn't
really output Factor code, but something that looks like a quotation
but is actually a different data structure in the compiler).

In Factor, you don't need to think about balancing two stacks, only
one, because the 'dip' combinator encapsulates the second stack in a
completely safe and general way.

Slava

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Factor-talk mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Factor-talk Digest, Vol 42, Issue 10

Reply via email to