wrongtest opened a new pull request #9121:
URL: https://github.com/apache/tvm/pull/9121


   Currently `LoopPartition` pass will try to partition loops assiociated with 
condition in likely tag, it would be great if developers can take control of 
which loop to partition, no-matter whether the condition to eliminate is 
"likely" tagged.
   
   The PR add a pragma attr key `loop_partition_hint`, which can be tagged 
explicitly in schedule phace. The loop partition pass will consider all arith 
comparation conditions for hinted loop var.
   
   Below are two examples of how explicit controlled loop partition benefits, 
the target is on Ubuntu20.08 i7-7700, with llvm version 11.0 :
   
   - For max pooling with padding inlined, which create conditional buffer 
accesses
       ```python
       data = te.placeholder([1, 128, 56, 56], name="x")
       out = topi.nn.pool2d(data, kernel=[5, 5], stride=[1, 1], padding=[2, 2, 
2, 2], pool_type="max", dilation=[1, 1], layout="NCHW")
       pad = out.op.input_tensors[0]
       x = tvm.nd.array(np.random.randint(0, 64, [1, 128, 56, 
56]).astype("float32"))
   
       def test(do_partition):
           s = te.create_schedule([out.op])
           s[pad].compute_inline()
           n, c, h, w = s[out].op.axis
           if do_partition:
               s[out].pragma(h, "loop_partition_hint")
               s[out].pragma(w, "loop_partition_hint")
   
           with tvm.ir.transform.PassContext(config={"tir.LoopPartition": 
{"partition_const_loop": True}}):
               f = tvm.build(s, [data, out], "llvm")
           y = tvm.nd.array(np.zeros([1, 128, 56, 56]).astype("float32"))
           f(x, y)
           result = y.asnumpy()
           print(f.get_source("asm"))
           evaluator = f.time_evaluator(f.entry_name, tvm.cpu(), number=1000)
           print("partition=%s: %.3f millisecs" % (do_partition, evaluator(x, 
y).mean * 1000))
           return result
   
       r1 = test(do_partition=False)
       r2 = test(do_partition=True)
       testing.assert_allclose(r1, r2, rtol=1e-5)
       ```
       The performance I get:
       - no loop partition: 3.708 millisecs
       - with loop partition: 0.975 millisecs
       
   
   - For tiled matmul following TVM tensor expression tutorial, but with shape 
not divided by tiling factor. The tir split do not create a likely condition 
for it now.
       ```python
      M, N, K = 1025, 1025, 1025
      dtype = "float32"
      dev = tvm.cpu()
      a = tvm.nd.array(np.random.rand(M, K).astype(dtype), dev)
      b = tvm.nd.array(np.random.rand(K, N).astype(dtype), dev)
      k = te.reduce_axis((0, K), "k")
      A = te.placeholder((M, K), name="A")
      B = te.placeholder((K, N), name="B")
      C = te.compute((M, N), lambda x, y: te.sum(A[x, k] * B[k, y], axis=k), 
name="C")
      f = te.create_prim_func([A, B, C])
      s = tvm.tir.Schedule(f)
   
      def evaluate_operation(s, target, optimization):
          with tvm.ir.transform.PassContext(config={"tir.LoopPartition": 
{"partition_const_loop": True}}):
              print(tvm.lower(s.mod["main"], [], simple_mode=True))
              func = tvm.build(s.mod["main"], [], target=target, name="mmult")
              assert func
   
          c = tvm.nd.array(np.zeros((M, N), dtype=dtype), dev)
          func(a, b, c)
          evaluator = func.time_evaluator(func.entry_name, dev, number=10)
          mean_time = evaluator(a, b, c).mean
          print("%s: %f" % (optimization, mean_time))
   
      # no opt
      evaluate_operation(s, target="llvm", optimization="none")
   
      # tiling and vectorize
      x, y, k = s.get_loops(s.get_block("C"))
      xo, xi = s.split(x, factors=[None, 32])
      yo, yi = s.split(y, factors=[None, 32])
      ko, ki = s.split(k, factors=[None, 4])
      s.reorder(xo, yo, ko, ki, xi, yi)
      s.vectorize(yi)
      evaluate_operation(s, target="llvm", optimization="blocking")
   
      # loop partition
      def pragma(s, rv, key):
          sref = s.get_sref(rv)
          loop = sref.stmt
          new_loop = tvm.tir.For(loop.loop_var, loop.min, loop.extent, 
loop.kind, loop.body, annotations={key: 1})
          s.state.replace(sref, new_loop)
      pragma(s, xo, "pragma_loop_partition_hint")
      pragma(s, yo, "pragma_loop_partition_hint")
      evaluate_operation(s, target="llvm",  optimization="loop_partition")
      ```
      The performance I get:
       - no opt: 3.708 millisecs: 1.374402
       - with tiling + vectorize:  0.843930
       - with tiling + vectorize + loop partition: 0.272183
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to