* Base on tvm 0.18, it generate **vfmadd213ps** instruction (**fused multiply-add**) in disassemble function vector_mul_add_compute_, so tvm can transform such optimazation for custom operator defined with te. ``` 382 00000000000016e0 <vector_mul_add_compute_>: ... 399 174e: 62 72 5d 48 a8 21 vfmadd213ps (%rcx),%zmm4,%zmm12 400 1754: 62 72 55 48 a8 69 01 vfmadd213ps 0x40(%rcx),%zmm5,%zmm13 ... ```
* test for tvm ``` (tvm0.18_py310_zyd) root@j00595921debug2-cc95c9977-q752v:/home/zhongyunde# cat vector_matmul_add5.py import tvm from tvm import te import numpy as np def vector_mul_add(n, dtype="float32"): """ Element-wise vector fused multiply-add (FMA): result[i] = a[i] * b[i] + c[i] for i in 0..n-1 """ a = te.placeholder((n,), dtype=dtype, name="a") # Input vector a b = te.placeholder((n,), dtype=dtype, name="b") # Input vector b c = te.placeholder((n,), dtype=dtype, name="c") # Input vector c # Element-wise multiplication: mul[i] = a[i] * b[i] mul = te.compute((n,), lambda i: a[i] * b[i], name="mul") # Element-wise addition: result[i] = mul[i] + c[i] result = te.compute((n,), lambda i: mul[i] + c[i], name="result") return [a, b, c, mul, result] # Main program n = 128 # Vector length (should be multiple of 16 for AVX512 full utilization) a, b, c, mul, result = vector_mul_add(n) # Unpack tensors from function s = te.create_schedule(result.op) # Create schedule for the computation graph # Get operation objects for scheduling mul_op = mul.op # Apply vectorization for AVX512-FMA instructions: # - Maps to 16-wide float32 operations using 'vfmadd231ps' s[mul_op].vectorize(mul_op.axis[0]) # Vectorize multiplication s[result].vectorize(result.op.axis[0]) # Vectorize addition # Compile with AVX512-FMA support for Skylake-AVX512 architecture # -mcpu=skylake-avx512 enables: # * 512-bit vector registers (ZMM0-ZMM31) # * FMA3 instructions for fused multiply-add # * 16 single-precision floats processed per instruction target = "llvm -mcpu=skylake-avx512" with tvm.transform.PassContext(opt_level=3): func = tvm.build(s, [a, b, c, result], target, name="vector_mul_add") # Export optimized binary as a shared library lib_path = "vector_mul_add.so" func.export_library(lib_path) print(f"Optimized binary with AVX512-FMA support exported to: {lib_path}") ``` --- [Visit Topic](https://discuss.tvm.apache.org/t/which-pass-is-expected-to-fused-or-combined-operator-in-tvm/18398/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/c2013d64fc38e665130c042e0031fa1c74c31ed0fbfeeb77869fe0892a0153b1).