There's been a discussion on fast tristate logic recently in here.
Thanks for the tip. It uses one of the implementations I tested too, a LUT in a shift register. However, I think I found the regular lookup-table to be faster on x86 (takes one instruction):
e.g. AND: static ubyte[16] lut = [0,0,0,0, 0,1,2,2, 0,2,2,2, 0,2,2,2]; value = lut[x*4+y]; //turn off boundary checks
