luchangli03 opened a new issue #6992:
URL: https://github.com/apache/tvm/issues/6992


   I tried to tune the following example, in which multiple output nodes are 
exported. This situation can be very common when you try to fuse a sub-graph, 
but I found Ansor can't not handle it, and Ansor can only generate codes for 
only one output node. In generated cuda code, I can clearly see there is only 
one output data pointer. TVM can correctly handle this problem, but TVM will 
generate two cuda kernel, which result in bad performance.
   
   @auto_scheduler.register_workload
   def add_test(data_shape0, data_shape1, data_shape2, data_shape3, 
dtype="float32"):
       data0 = te.placeholder(data_shape0, name="data0", dtype=dtype)
       data1 = te.placeholder(data_shape1, name="data1", dtype=dtype)
       data2 = te.placeholder(data_shape2, name="data2", dtype=dtype)
       data3 = te.placeholder(data_shape3, name="data3", dtype=dtype)
   
       add1 = topi.add(data0, data1)
       add2 = topi.add(data2, data3)
       add3 = topi.add(add1, add2)
   
       return [add2, add3, data0, data1, data2, data3]
   
   data_shape0 = (16, 1)
   data_shape1 = (1, 16)
   data_shape2 = (1, 16)
   data_shape3 = (16, 1)
   
   dtype = "float32"
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to