* Base on tvm 0.18, it generate **vfmadd213ps** instruction (**fused 
multiply-add**) in disassemble function vector_mul_add_compute_, so tvm can 
transform such optimazation for custom operator defined with te.
```
382 00000000000016e0 <vector_mul_add_compute_>:
...
399     174e:       62 72 5d 48 a8 21       vfmadd213ps (%rcx),%zmm4,%zmm12
400     1754:       62 72 55 48 a8 69 01    vfmadd213ps 0x40(%rcx),%zmm5,%zmm13
...
```

* test for tvm
```
(tvm0.18_py310_zyd) root@j00595921debug2-cc95c9977-q752v:/home/zhongyunde# cat 
vector_matmul_add5.py 
import tvm
from tvm import te
import numpy as np

def vector_mul_add(n, dtype="float32"):
    """
    Element-wise vector fused multiply-add (FMA):
    result[i] = a[i] * b[i] + c[i] for i in 0..n-1
    """
    a = te.placeholder((n,), dtype=dtype, name="a")  # Input vector a
    b = te.placeholder((n,), dtype=dtype, name="b")  # Input vector b
    c = te.placeholder((n,), dtype=dtype, name="c")  # Input vector c
    
    # Element-wise multiplication: mul[i] = a[i] * b[i]
    mul = te.compute((n,), lambda i: a[i] * b[i], name="mul")
    
    # Element-wise addition: result[i] = mul[i] + c[i]
    result = te.compute((n,), lambda i: mul[i] + c[i], name="result")
    
    return [a, b, c, mul, result]

# Main program
n = 128  # Vector length (should be multiple of 16 for AVX512 full utilization)
a, b, c, mul, result = vector_mul_add(n)  # Unpack tensors from function
s = te.create_schedule(result.op)  # Create schedule for the computation graph

# Get operation objects for scheduling
mul_op = mul.op

# Apply vectorization for AVX512-FMA instructions:
# - Maps to 16-wide float32 operations using 'vfmadd231ps'
s[mul_op].vectorize(mul_op.axis[0])  # Vectorize multiplication
s[result].vectorize(result.op.axis[0])  # Vectorize addition

# Compile with AVX512-FMA support for Skylake-AVX512 architecture
# -mcpu=skylake-avx512 enables:
#   * 512-bit vector registers (ZMM0-ZMM31)
#   * FMA3 instructions for fused multiply-add
#   * 16 single-precision floats processed per instruction
target = "llvm -mcpu=skylake-avx512"
with tvm.transform.PassContext(opt_level=3):
    func = tvm.build(s, [a, b, c, result], target, name="vector_mul_add")

# Export optimized binary as a shared library
lib_path = "vector_mul_add.so"
func.export_library(lib_path)
print(f"Optimized binary with AVX512-FMA support exported to: {lib_path}")
```





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/which-pass-is-expected-to-fused-or-combined-operator-in-tvm/18398/2)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/c2013d64fc38e665130c042e0031fa1c74c31ed0fbfeeb77869fe0892a0153b1).

Reply via email to