I've pushed the change that makes `assoc', `assv', and `assq' implemented in Racket, including the new optional argument for `assoc'.
At Fri, 22 Apr 2011 19:52:34 -0400, Eli Barzilay wrote: > 7 minutes ago, Matthew Flatt wrote: > > > > Switching to a Racket implementation costs time, but (with the JIT) > > it's not a huge difference --- similar to the cost of switching the C > > code from 32-bit to 64-bit mode on my machine. The JIT even manages to > > make `assq' slightly faster than C in 64-bit mode. > > Is there some obvious reason for the huge difference in improvement > between the 32 and the 64 bits (Almost twice slower and roughly the > same resp.)? Like Noel, I thought at first it might be memory-bound. Now I think it's just that the compiler on my machine (or the way the compiler is configured) handles x86_64 not as well as x86. > > The cost for a non-JIT platform is why I hadn't changed `assq', > > `assv', and `assoc' before. But a plain-Racket implementation is > > surely better in the long run, and we can keep the C code and bind > > functions conditionally when the JIT is disabled. Maybe we can > > eventually close the gap between the JIT-generated code and C. Also, > > I think the JIT could be smarter about `equal?', which could tilt > > the balance in favor of the JIT for `assoc'. > > (Maybe it'll be feasible to dump jit-generated machine code at some > point...?) Just so you can see the generated code? I think Sam's tool is on the right track. Otherwise, I'm not sure what you're suggesting. > > I'm happy that the compiler is doing its job, but inlining > > optimizations are fragile, so I'm a little reluctant to rely on > > them. Sad but true: the macro approach feels more comfortable for > > now. > > If by "macro approach" you refer to what I was talking about, then > isn't that needed to have versions with inlined comparators? Every time I look into changing `map' or other functions into macros, I conclude that it would be better to implement cross-module inlining. At Fri, 22 Apr 2011 19:27:49 -0500, Robby Findler wrote: > What are the non-JIT platforms nowadays? Sparc and ARM, probably. I expect Sparc to fade away, and I imagine adding ARM support to the JIT, eventually. Currently, the Racket implementation of `assq', etc., is used always. I've thought about ways to fall back to the C implementation when the JIT is unavailable, but I haven't settled on anything yet. Below are revised results for the performance test. The old C-implemented `assq' is now `kernel:assq', and the Racket-implemented `assq2' is now `assq'. The improvements are from adjusting the Racket code slightly and improving the JIT in various small ways. There's more room for improvement, obviously, and I stopped only because I've had enough for now. For those who are *really* interested, I've included annotated, disassembled JIT output at the end of this message. The JIT-generated core loop and the gcc-generated core loop are very similar; there's four times as much JIT-generated code because it unrolls the loop to four copies. (Both versions test two elements of the list in each iteration, because the "turtle" to detect cycles is incremented only with the second test.) ---------------------------------------- 32-bit run: 'kern:assq cpu time: 23 real time: 23 gc time: 0 cpu time: 38 real time: 39 gc time: 0 cpu time: 39 real time: 38 gc time: 0 'assq cpu time: 38 real time: 39 gc time: 0 cpu time: 60 real time: 61 gc time: 0 cpu time: 61 real time: 61 gc time: 0 'kern:assoc cpu time: 60 real time: 60 gc time: 0 cpu time: 563 real time: 567 gc time: 0 cpu time: 554 real time: 555 gc time: 0 'assoc cpu time: 74 real time: 74 gc time: 0 cpu time: 589 real time: 592 gc time: 0 cpu time: 582 real time: 584 gc time: 0 ---------------------------------------- 64-bit run: 'kern:assq cpu time: 36 real time: 36 gc time: 0 cpu time: 117 real time: 117 gc time: 0 cpu time: 118 real time: 118 gc time: 0 'assq cpu time: 39 real time: 39 gc time: 0 cpu time: 66 real time: 65 gc time: 0 cpu time: 65 real time: 65 gc time: 0 'kern:assoc cpu time: 81 real time: 81 gc time: 0 cpu time: 860 real time: 860 gc time: 0 cpu time: 798 real time: 798 gc time: 0 'assoc cpu time: 83 real time: 83 gc time: 0 cpu time: 782 real time: 789 gc time: 0 cpu time: 738 real time: 739 gc time: 0 ---------------------------------------- The main `assq' loop. Arguments: list, turtle; environment: value to find, original list. 0x0676707a: mov 0xc(%ebx),%eax [get list off stack] 0x0676707d: test $0x1,%al 0x06767080: jne 0x676756b [to fail: fixnum instead of list] 0x06767086: movswl (%eax),%ecx 0x06767089: cmp $0x34,%ecx 0x0676708f: jne 0x676756b [to fail: not a pair] 0x06767095: mov 0x4(%eax),%eax [car => item] 0x06767098: mov %eax,-0x4(%ebx) [save item on stack] 0x0676709b: add $0xfffffffc,%ebx [adjust stack pointer] 0x0676709e: test $0x1,%al 0x067670a1: jne 0x676752a [to fail: item is fixnum] 0x067670a7: movswl (%eax),%ecx 0x067670aa: cmp $0x34,%ecx 0x067670b0: jne 0x676752a [to fail: item is not a pair] 0x067670b6: mov 0x4(%eax),%eax [car of item] 0x067670b9: mov %eax,%ecx 0x067670bb: mov 0x8(%ebx),%eax [get key to find off stack] 0x067670be: cmp %ecx,%eax [found key?] 0x067670c0: jne 0x67670cd [jump if not...] 0x067670c6: mov (%ebx),%eax [yes... get item from stack] 0x067670c8: jmp 0x6767525 [and return item] 0x067670cd: mov 0x10(%ebx),%eax [no... get list from stack] 0x067670d0: mov 0x8(%eax),%eax [cdr of list] 0x067670d3: mov %eax,-0x4(%ebx) [save new list to stack] 0x067670d6: add $0xfffffffc,%ebx [adjust stack pointer] 0x067670d9: mov 0x18(%ebx),%ecx [get turtle from stack] 0x067670dc: cmp %ecx,%eax [(eq? turtle list)] 0x067670de: jne 0x6767111 [normally jumps...] 0x067670e4: mov $0x67675c0,%eax [fail: found cycle...] 0x067670e9: mov (%eax),%eax [... call error function ...] 0x067670eb: mov %eax,-0xc(%ebx) 0x067670ee: mov $0x67675c4,%eax 0x067670f3: mov (%eax),%eax 0x067670f5: mov %eax,-0x8(%ebx) 0x067670f8: mov 0x10(%ebx),%eax 0x067670fb: mov %eax,-0x4(%ebx) 0x067670fe: mov $0x34540,%esi 0x06767103: add $0xfffffff4,%ebx 0x06767106: mov %ebx,0x44c(%edi) 0x0676710c: jmp 0x376c90 0x06767111: mov (%ebx),%eax [get list from stack... (redundant!)] 0x06767113: test $0x1,%al [... starts another iteration of above] 0x06767116: jne 0x67674e0 [... and there are 4 copies total] 0x0676711c: movswl (%eax),%ecx 0x0676711f: cmp $0x34,%ecx 0x06767125: jne 0x67674e0 0x0676712b: mov 0x4(%eax),%eax 0x0676712e: mov %eax,-0x4(%ebx) 0x06767131: add $0xfffffffc,%ebx 0x06767134: test $0x1,%al 0x06767137: jne 0x676749f 0x0676713d: movswl (%eax),%ecx 0x06767140: cmp $0x34,%ecx 0x06767146: jne 0x676749f 0x0676714c: mov 0x4(%eax),%eax 0x0676714f: mov %eax,%ecx 0x06767151: mov 0x10(%ebx),%eax 0x06767154: cmp %ecx,%eax 0x06767156: jne 0x6767163 0x0676715c: mov (%ebx),%eax 0x0676715e: jmp 0x676749a 0x06767163: mov 0x1c(%ebx),%eax 0x06767166: mov 0x8(%eax),%eax 0x06767169: mov %eax,-0x4(%ebx) 0x0676716c: mov 0x4(%ebx),%eax 0x0676716f: mov 0x8(%eax),%eax 0x06767172: mov %eax,-0x8(%ebx) 0x06767175: add $0xfffffff8,%ebx 0x06767178: mov 0x4(%ebx),%ecx 0x0676717b: cmp %ecx,%eax 0x0676717d: jne 0x67671b0 0x06767183: mov $0x67675c8,%eax 0x06767188: mov (%eax),%eax 0x0676718a: mov %eax,-0xc(%ebx) 0x0676718d: mov $0x67675cc,%eax 0x06767192: mov (%eax),%eax 0x06767194: mov %eax,-0x8(%ebx) 0x06767197: mov 0x1c(%ebx),%eax 0x0676719a: mov %eax,-0x4(%ebx) 0x0676719d: mov $0x34540,%esi 0x067671a2: add $0xfffffff4,%ebx 0x067671a5: mov %ebx,0x44c(%edi) 0x067671ab: jmp 0x376c90 0x067671b0: mov (%ebx),%eax [get list from stack... (redundant!)] 0x067671b2: test $0x1,%al 0x067671b5: jne 0x6767455 0x067671bb: movswl (%eax),%ecx 0x067671be: cmp $0x34,%ecx 0x067671c4: jne 0x6767455 0x067671ca: mov 0x4(%eax),%eax 0x067671cd: mov %eax,-0x4(%ebx) 0x067671d0: add $0xfffffffc,%ebx 0x067671d3: test $0x1,%al 0x067671d6: jne 0x6767414 0x067671dc: movswl (%eax),%ecx 0x067671df: cmp $0x34,%ecx 0x067671e5: jne 0x6767414 0x067671eb: mov 0x4(%eax),%eax 0x067671ee: mov %eax,%ecx 0x067671f0: mov 0x1c(%ebx),%eax 0x067671f3: cmp %ecx,%eax 0x067671f5: jne 0x6767202 0x067671fb: mov (%ebx),%eax 0x067671fd: jmp 0x676740f 0x06767202: mov 0x4(%ebx),%eax 0x06767205: mov 0x8(%eax),%eax 0x06767208: mov %eax,-0x4(%ebx) 0x0676720b: add $0xfffffffc,%ebx 0x0676720e: mov 0xc(%ebx),%ecx 0x06767211: cmp %ecx,%eax 0x06767213: jne 0x6767246 0x06767219: mov $0x67675d0,%eax 0x0676721e: mov (%eax),%eax 0x06767220: mov %eax,-0xc(%ebx) 0x06767223: mov $0x67675d4,%eax 0x06767228: mov (%eax),%eax 0x0676722a: mov %eax,-0x8(%ebx) 0x0676722d: mov 0x24(%ebx),%eax 0x06767230: mov %eax,-0x4(%ebx) 0x06767233: mov $0x34540,%esi 0x06767238: add $0xfffffff4,%ebx 0x0676723b: mov %ebx,0x44c(%edi) 0x06767241: jmp 0x376c90 0x06767246: mov (%ebx),%eax [get list from stack... (redundant!)] 0x06767248: test $0x1,%al 0x0676724b: jne 0x67673ca 0x06767251: movswl (%eax),%ecx 0x06767254: cmp $0x34,%ecx 0x0676725a: jne 0x67673ca 0x06767260: mov 0x4(%eax),%eax 0x06767263: mov %eax,-0x4(%ebx) 0x06767266: add $0xfffffffc,%ebx 0x06767269: test $0x1,%al 0x0676726c: jne 0x6767389 0x06767272: movswl (%eax),%ecx 0x06767275: cmp $0x34,%ecx 0x0676727b: jne 0x6767389 0x06767281: mov 0x4(%eax),%eax 0x06767284: mov %eax,%ecx 0x06767286: mov 0x24(%ebx),%eax 0x06767289: cmp %ecx,%eax 0x0676728b: jne 0x6767298 0x06767291: mov (%ebx),%eax 0x06767293: jmp 0x6767384 0x06767298: mov 0x10(%ebx),%eax 0x0676729b: mov 0x8(%eax),%eax 0x0676729e: mov %eax,-0x4(%ebx) 0x067672a1: mov 0x4(%ebx),%eax 0x067672a4: mov 0x8(%eax),%eax 0x067672a7: mov %eax,-0x8(%ebx) 0x067672aa: add $0xfffffff8,%ebx 0x067672ad: mov 0x4(%ebx),%ecx 0x067672b0: cmp %ecx,%eax 0x067672b2: jne 0x67672e5 0x067672b8: mov $0x67675d8,%eax 0x067672bd: mov (%eax),%eax 0x067672bf: mov %eax,-0xc(%ebx) 0x067672c2: mov $0x67675dc,%eax 0x067672c7: mov (%eax),%eax 0x067672c9: mov %eax,-0x8(%ebx) 0x067672cc: mov 0x30(%ebx),%eax 0x067672cf: mov %eax,-0x4(%ebx) 0x067672d2: mov $0x34540,%esi 0x067672d7: add $0xfffffff4,%ebx 0x067672da: mov %ebx,0x44c(%edi) 0x067672e0: jmp 0x376c90 0x067672e5: mov (%ebx),%eax [get list from stack... (redundant!)] 0x067672e7: mov %eax,-0x8(%ebx) [... save it(!) to prep recursive call] 0x067672ea: mov 0x4(%ebx),%eax [get turtle from stack] 0x067672ed: add $0xfffffff8,%ebx [adjust stack pointer] 0x067672f0: lea 0x64(%edi),%edx [check whether other threads needs to run...] 0x067672f3: mov (%edx),%edx 0x067672f5: cmp $0x0,%edx 0x067672fb: jle 0x6767312 [to slow: for thread swap] 0x067672fd: mov -0x1c(%ebp),%edx [argv = end of orig argv (tail call)] 0x06767300: add $0xffffffec,%edx [... minus orig argc for args in-place] 0x06767303: mov %eax,0x10(%edx) [set turtle argument for recur] 0x06767306: mov (%ebx),%eax [load list back] 0x06767308: mov %eax,0xc(%edx) [... and set list argument for recur] 0x0676730b: mov %edx,%ebx [set Racket stack to argv] 0x0676730d: jmp 0x676707a [recur] ---------------------------------------- For comparsion, the gcc-compiled `assq' function's loop: 0x008c6480 <assq+112>: mov -0x20(%ebp),%edx [get l off stack] 0x008c6483 <assq+115>: test $0x1,%dl 0x008c6486 <assq+118>: jne 0x8c6523 <assq+275> [to fail: l is fixnum] 0x008c648c <assq+124>: cmpw $0x34,(%edx) 0x008c6490 <assq+128>: jne 0x8c6523 <assq+275> [to fail: not a pair] 0x008c6496 <assq+134>: mov 0x4(%edx),%ecx [item = car of l] 0x008c6499 <assq+137>: mov %ecx,-0x1c(%ebp) [save item on stack] 0x008c649c <assq+140>: test $0x1,%cl 0x008c649f <assq+143>: jne 0x8c6588 <assq+376> [to fail: item is fixnum] 0x008c64a5 <assq+149>: cmpw $0x34,(%ecx) 0x008c64a9 <assq+153>: jne 0x8c6588 <assq+376> [to fail: item is not pair] 0x008c64af <assq+159>: mov 0xc(%ebp),%esi [get x arg off stack] 0x008c64b2 <assq+162>: mov (%esi),%eax [found x?] 0x008c64b4 <assq+164>: cmp 0x4(%ecx),%eax 0x008c64b7 <assq+167>: je 0x8c6648 <scheme_get_thread_local_variables> 0x008c64bd <assq+173>: mov 0x8(%edx),%edx [no, so get cdr of l] 0x008c64c0 <assq+176>: mov %edx,-0x20(%ebp) 0x008c64c3 <assq+179>: test $0x1,%dl 0x008c64c6 <assq+182>: jne 0x8c6480 <assq+112> [to fail: cdr is fixnum] 0x008c64c8 <assq+184>: cmpw $0x34,(%edx) 0x008c64cc <assq+188>: jne 0x8c6480 <assq+112> [to fail: cdr is not a pair] 0x008c64ce <assq+190>: mov 0x4(%edx),%ecx [item = car l] 0x008c64d1 <assq+193>: mov %ecx,-0x1c(%ebp) 0x008c64d4 <assq+196>: test $0x1,%cl 0x008c64d7 <assq+199>: jne 0x8c6480 <assq+112> [fail: item is a fixnum] 0x008c64d9 <assq+201>: cmpw $0x34,(%ecx) 0x008c64dd <assq+205>: jne 0x8c6480 <assq+112> [fail: item is not a pair] 0x008c64df <assq+207>: mov (%esi),%eax [get x off stack] 0x008c64e1 <assq+209>: cmp 0x4(%ecx),%eax [car of item == x ?] 0x008c64e4 <assq+212>: je 0x8c6648 <scheme_get_thread_local_variables> 0x008c64ea <assq+218>: mov 0x8(%edx),%eax 0x008c64ed <assq+221>: mov %eax,-0x20(%ebp) 0x008c64f0 <assq+224>: mov -0x24(%ebp),%edx [get turtle off stack] 0x008c64f3 <assq+227>: cmp %edx,%eax [is l == turtle?] 0x008c64f5 <assq+229>: je 0x8c6523 <assq+275> [to fail: cycle] 0x008c64f7 <assq+231>: mov 0x8(%edx),%eax [cdr of turtle] 0x008c64fa <assq+234>: mov %eax,-0x24(%ebp) [save new turtle to stack] 0x008c64fd <scheme_get_thread_local_variables+0>: mov (%edi),%eax 0x008c64ff <scheme_get_thread_local_variables+2>: mov %gs:0x48(,%eax,4),%eax 0x008c6507 <assq+247>: mov 0x68(%eax),%eax 0x008c650a <assq+250>: test %eax,%eax [other threads need to run?] 0x008c650c <assq+252>: jg 0x8c6480 <assq+112> [to slow: other threads run] 0x008c6512 <assq+258>: movl $0x4,-0x50(%ebp) 0x008c6519 <assq+265>: call 0x9e44a0 <scheme_out_of_fuel> 0x008c651e <assq+270>: jmp 0x8c6480 <assq+112> [recur] ---------------------------------------- The `assq' wrapper that jumps to the above loop. The argument juggling here --- turning the arg sequence `x l' to `l l l x' --- may be part of the slowdown. 0x0037e41c: cmp $0x2,%ecx [arity check] 0x0037e422: jne 0x372e80 0x0037e428: mov %eax,-0x4(%ebx) [save closure (but no GC possible!)] 0x0037e42b: add $0xfffffffc,%ebx 0x0037e42e: mov 0x4(%ebx),%eax [get x arg] 0x0037e431: mov %eax,-0x10(%ebx) [prep x arg to `loop'] 0x0037e434: mov 0x8(%ebx),%eax [get l arg] 0x0037e437: mov %eax,-0xc(%ebx) [prep l arg to `loop'] 0x0037e43a: mov %eax,-0x8(%ebx) [prep orig-l arg to `loop'] 0x0037e43d: mov %eax,-0x4(%ebx) [prep turtle arg to `loop'] 0x0037e440: mov $0x37e47c,%esi 0x0037e445: mov (%esi),%esi [esi now has `loop' closure pointer] 0x0037e447: add $0xfffffff0,%ebx [update Racket stack register] 0x0037e44a: mov -0x1c(%ebp),%edx [get Racket stack base for tail calls] 0x0037e44d: add $0xfffffff0,%edx [4 argument => subtract 16] 0x0037e450: mov 0xc(%ebx),%ecx [move arg1 into place...] 0x0037e453: mov %ecx,0xc(%edx) 0x0037e456: mov 0x8(%ebx),%ecx 0x0037e459: mov %ecx,0x8(%edx) 0x0037e45c: mov 0x4(%ebx),%ecx 0x0037e45f: mov %ecx,0x4(%edx) 0x0037e462: mov (%ebx),%ecx 0x0037e464: mov %ecx,(%edx) 0x0037e466: mov %edx,%ebx [runstack = argv] 0x0037e468: mov %esi,%eax 0x0037e46a: mov $0x4,%ecx [argc = 4] 0x0037e46f: mov %ebx,%edx [argv = runstack (redundant!)] 0x0037e471: jmp 0x0676707a ---------------------------------------- Then again, here's a comparable section of the gcc-compiled `assq' function's preamble that cooperates with the GC and sets up the loop. 0x008c641f <scheme_get_thread_local_variables+0>: mov 0x176c06(%ebx),%edi 0x008c6425 <scheme_get_thread_local_variables+6>: mov (%edi),%eax 0x008c6427 <scheme_get_thread_local_variables+8>: mov %gs:0x48(,%eax,4),%edx 0x008c642f <assq+31>: mov 0x4(%edx),%edx 0x008c6432 <assq+34>: mov %edx,-0x54(%ebp) 0x008c6435 <scheme_get_thread_local_variables+0>: mov %gs:0x48(,%eax,4),%eax 0x008c643d <assq+45>: lea -0x54(%ebp),%edx 0x008c6440 <assq+48>: mov %edx,0x4(%eax) 0x008c6443 <assq+51>: lea 0xc(%ebp),%eax 0x008c6446 <assq+54>: mov %eax,-0x4c(%ebp) 0x008c6449 <assq+57>: lea -0x1c(%ebp),%eax 0x008c644c <assq+60>: mov %eax,-0x48(%ebp) 0x008c644f <assq+63>: lea -0x24(%ebp),%eax 0x008c6452 <assq+66>: mov %eax,-0x44(%ebp) 0x008c6455 <assq+69>: lea -0x20(%ebp),%eax 0x008c6458 <assq+72>: mov %eax,-0x40(%ebp) 0x008c645b <assq+75>: movl $0x0,-0x1c(%ebp) 0x008c6462 <assq+82>: movl $0x0,-0x20(%ebp) 0x008c6469 <assq+89>: movl $0x0,-0x24(%ebp) 0x008c6470 <assq+96>: mov 0xc(%ebp),%eax 0x008c6473 <assq+99>: mov 0x4(%eax),%eax 0x008c6476 <assq+102>: mov %eax,-0x24(%ebp) 0x008c6479 <assq+105>: mov %eax,-0x20(%ebp) 0x008c647c <assq+108>: nopl 0x0(%eax) Note the NOP, which aligns the gcc-generated loop start. The JIT doesn't try to align loop starts, because past experiments suggested that it doesn't matter. That's worth another look. _________________________________________________ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev