Type conversion from uint8 to uint32 (or uint64) is technically free because registers have a size of 32-bit or 64-bit.
If anything it's less costly because at a low-level loading a 8-bit value requires movzb (mov to register and zero extend the byte) which has a slightly high latency than plain mov for uint32 and uint64. The issue in your convolutions is memory bottleneck not compute.
