luchangli03 opened a new issue #6992:
URL: https://github.com/apache/tvm/issues/6992
I tried to tune the following example, in which multiple output nodes are
exported. This situation can be very common when you try to fuse a sub-graph,
but I found Ansor can't not handle it, and Ansor can only generate codes for
only one output node. In generated cuda code, I can clearly see there is only
one output data pointer. TVM can correctly handle this problem, but TVM will
generate two cuda kernel, which result in bad performance.
@auto_scheduler.register_workload
def add_test(data_shape0, data_shape1, data_shape2, data_shape3,
dtype="float32"):
data0 = te.placeholder(data_shape0, name="data0", dtype=dtype)
data1 = te.placeholder(data_shape1, name="data1", dtype=dtype)
data2 = te.placeholder(data_shape2, name="data2", dtype=dtype)
data3 = te.placeholder(data_shape3, name="data3", dtype=dtype)
add1 = topi.add(data0, data1)
add2 = topi.add(data2, data3)
add3 = topi.add(add1, add2)
return [add2, add3, data0, data1, data2, data3]
data_shape0 = (16, 1)
data_shape1 = (1, 16)
data_shape2 = (1, 16)
data_shape3 = (16, 1)
dtype = "float32"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]