Hi, I'm interested in execution performance.
Maybe modern hardware (which implement IA64, ARMv8) is able to predict a long chain of jumps [1]. But prediction accuracy for indirect jump is low, especially dynamic addressed indirect jumps. By the way, Ryan's example code will be fast by following optimization: (If c3aX is most fast path, c3aX is reached without taken-branch.) // Skip to the chase if it's already evaluated: start: // if (R2 & 7 != 0) goto fastpath; else goto slowpath; if (R2 & 7 == 0) goto slowpath; // *** (1) remove branch for fastpath fastpath: // Formerly c3aO // *** (1) move fastpath here // if (R1 & 7 >= 2) goto c3aW; else goto c3aX; if (R1 & 7 >= 2) goto c3aW; // *** (2) remove branch for prior path(c3aX) c3aX: // *** (2) move else path to here(without branch) R1 = PicBaseReg + lvl_r39S_closure; call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump, but fixed address (100% hit) c3aW: R1 = P64[R1 + 6] & (-8); call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump, dynamic address (hit or miss) //c3aX: // R1 = PicBaseReg + lvl_r39S_closure; // call (I64[R1])(R1) args: 8, res: 0, upd: 8; slowpath: // Formerly c3aY if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0; c3aZ: // nop R1 = PicBaseReg + foo_closure; call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8; c3b0: I64[Sp - 8] = PicBaseReg + block_c3aO_info; R1 = R2; Sp = Sp - 8; call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8; // Sp bump moved to here so it's separate from "fastpath" Sp = Sp + 8; goto fastpath; // *** //fastpath: // Formerly c3aO // if (R1 & 7 >= 2) goto c3aW; else goto c3aX; //c3aW: // R1 = P64[R1 + 6] & (-8); // call (I64[R1])(R1) args: 8, res: 0, upd: 8; //c3aX: // R1 = PicBaseReg + lvl_r39S_closure; // call (I64[R1])(R1) args: 8, res: 0, upd: 8; [1]: Intel64 and IA-32 Architectures Optimization Reference Manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf 3.4 OPTIMIZING THE FRONT END 2.3.2.3 Branch Prediction I'm just studying and drawing about lazy evaluation. This thread is helpful to me :) Regards, Takenobu 2015-10-25 5:53 GMT+09:00 Carter Schonwald <carter.schonw...@gmail.com>: > Doesn't modern hardware have pretty good branch prediction? In which case > the order of the branches may not matter unless it's a long chain of calls? > Vs say an inner loop that hasn't been inlined? > > Either way, I'd love be stay in the loop on this topic, for work I'm > building a strongly normalizing language that supports both strict and call > by need evaluation strategies. > > > On Friday, October 23, 2015, Ryan Newton <rrnew...@gmail.com> wrote: > >> >>> 1. Small tweaks: The CMM code above seems to be *betting* than the >>> thunk is unevaluated, because it does the stack check and stack write >>> *before* the predicate test that checks if the thunk is evaluated (if >>> (R1 & 7 != 0) goto c3aO; else goto c3aP;). With a bang-pattern >>> function, couldn't it make the opposite bet? That is, branch on whether >>> the thunk is evaluated first, and then the wasted computation is only a >>> single correctly predicted branch (and a read of a tag that we need to >>> read >>> anyway). >>> >>> Oh, a small further addition would be needed for this tweak. In the >> generated code above "Sp = Sp + 8;" happens *late*, but I think it could >> happen right after the call to the thunk. In general, does it seem >> feasible to separate the slowpath from fastpath as in the following tweak >> of the example CMM? >> >> >> * // Skip to the chase if it's already evaluated:* >> * start:* >> * if (R2 & 7 != 0) goto fastpath; else goto slowpath;* >> >> * slowpath: // Formerly c3aY* >> * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* >> * c3aZ:* >> * // nop* >> * R1 = PicBaseReg + foo_closure;* >> * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* >> * c3b0:* >> * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* >> * R1 = R2;* >> * Sp = Sp - 8;* >> >> * call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* >> * // Sp bump moved to here so it's separate from "fastpath"* >> * Sp = Sp + 8;* >> >> * fastpath: // Formerly c3aO* >> * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* >> * c3aW:* >> * R1 = P64[R1 + 6] & (-8);* >> * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* >> * c3aX:* >> * R1 = PicBaseReg + lvl_r39S_closure;* >> * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* >> >> >> >> > _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs > >
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs