Ok, the new bottleneck in my program is the function update-log-pr-xi:
(defun-inline log-sum-exp (v index-to)
(let ((vMax (df-vmap-max v :index-to-ignore index-to)))
(df+ vMax
(log (df-vmap-sum #'(lambda (vK) (exp (- vK vMax)))
v
:index-to-ignore index-to)))))
(defun update-log-pr-xi (likelihood)
(let ((nc (num-components likelihood)))
(vmap (#'(lambda (dfa)
(log-sum-exp dfa nc))
:result-element-type df
:default-argument-element-type df-vec)
(Lij likelihood)
:result (log-prob-of-xi likelihood))))
vmap, df-vmap-max, and df-vmap-sum are all macros that basically save
me doing type declarations. For instance,
* (macroexpand '(df-vmap-max v :index-to-ignore index-to))
(LET* ((#:INPUT2446 V)
(#:LENGTH2444 INDEX-TO)
(#:RESULT2443 MOST-NEGATIVE-DOUBLE-FLOAT))
(DECLARE (TYPE DF #:RESULT2443) (TYPE (SIMPLE-ARRAY DF (*)) #:INPUT2446))
(FIXTIMES (#:COUNTER2445 #:LENGTH2444)
(SETF #:RESULT2443
(FUNCALL #'MAX
#:RESULT2443
(FUNCALL #'MAX
(AREF #:INPUT2446 #:COUNTER2445)))))
#:RESULT2443)
T
*
I have many other functions, more or less of this type, all of which
give no compiler notes and seem to compile into functions that do
almost no consing. For this function, I'm getting:
; In: DEFUN UPDATE-LOG-PR-XI
; (LOG-SUM-EXP DFA NC)
; --> BLOCK LET DF-VMAP-MAX DF-VMAP-COLLECT VMAP-COLLECT LET* FIXTIMES LET
; --> DOTIMES DO BLOCK LET TAGBODY SETF SETQ FUNCALL C::%FUNCALL MAX LET LET IF
; ==>
; #:OO-38
; Note: Doing float to pointer coercion (cost 13).
and the function conses a fair bit, and ends up taking up a large
fraction of the total time, much larger than its fraction of the
computation should warrant. Looking at the disassembly, I do believe
that log-sum-exp is being inlined (I'm not seeing any comment about a
function call). Comparing it to another function that "behaves well",
they both use floating point registers ST(0) through ST(4).
My only current guess is that my function wants to also use ST(5) and
the compiler won't let it because it's reserving 5 through 7 for its
own purposes. This doesn't seem that convincing, because this
function isn't really more complicated than others that don't have
this problem, but I guess it's possible. Does it seem reasonably that
the compiler would want to save 3 FP registers for itself? Any other
ideas?
Cheers,
rif