Hi @tkonolige
Sorry for the delay in this response.
I modified the target to "llvm -mcpu=cascadelake" according to the target and
re-did the tuning. Now I get a much better inference time of < 100ms on
benchmark and VirtualMachineProfiler, but a 4x discrepancy still remains
between the output of the two profilers.
The outputs are attached below ([1]).
I tried ResNet-18 as well, but I am observing the same discrepancy there as
well.
On running without graph tuning, I am observing almost no discrepancy.
Interestingly the debug_executor's total inference time worsens when I enable
graph tuning, while that of the other two improves. The outputs are attached
below ([2]).
I haven't yet been able to get hold of another system to install and run these
experiments on, I will update this thread as soon as that happens.
__________________________________________________________________________________
Outputs:
[1] With Graph Tuning
(a) profiler_vm
```
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake,
workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR',
(1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest
context. A fallback configuration is used, which may bring great performance
regression.
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake,
workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000,
2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A
fallback configuration is used, which may bring great performance regression.
One or more operators have not been tuned. Please tune your model for better
performance. Use DEBUG logging level to see more details.
Name Duration (us) Percent
Count out_layout Device data_layout kernel_layout Hash
layout
Argument Shapes dst_layout weight_layout src_layout
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8 15,909.52 16.00
6 NCHW16c cpu0 NCHW16c OIHW16i16o efb9044cdd43e0b8
float32[1, 16, 14, 14,
16], float32[16, 16, 3, 3, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16,
14, 14, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5 10,522.82 10.58
4 NCHW32c cpu0 NCHW8c OIHW8i32o 0d551fd3800939e1
float32[1, 16, 28,
28, 8], float32[4, 16, 3, 3, 8, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28,
28, 32]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11 9,095.54 9.15
3 NCHW16c cpu0 NCHW16c OIHW16i16o 68695c5cd347ce57
float32[1, 32, 7, 7,
16], float32[32, 32, 3, 3, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7,
7, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2 8,034.25 8.08
3 NCHW32c cpu0 NCHW64c OIHW64i32o 83e0f5d1673ff2ae
float32[1, 1, 56,
56, 64], float32[2, 1, 3, 3, 64, 32], float32[1, 2, 1, 1, 32], float32[1, 2,
56, 56, 32]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9 6,451.60 6.49
5 NCHW16c cpu0 NCHW16c OIHW16i16o c8d2fb74508242fa
float32[1, 64, 14, 14,
16], float32[16, 64, 1, 1, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16,
14, 14, 16]
fused_nn_contrib_conv2d_NCHWc_add_2 6,219.45 6.25
5 NCHW16c cpu0 NCHW4c OIHW4i16o 991e77362efe315d
float32[1, 64, 14, 14,
4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64,
14, 14, 16]
fused_nn_contrib_conv2d_NCHWc_add_1 4,069.38 4.09
3 NCHW64c cpu0 NCHW32c OIHW32i64o b8f45dade76ef8ee
float32[1, 4, 28, 28,
32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 28,
28, 64]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6 3,627.03 3.65
3 NCHW32c cpu0 NCHW64c OIHW64i32o 435cfe42fcb8d0b0
float32[1, 8, 28,
28, 64], float32[4, 8, 1, 1, 64, 32], float32[1, 4, 1, 1, 32], float32[1, 4,
28, 28, 32]
fused_nn_contrib_conv2d_NCHWc_add 3,069.46 3.09
2 NCHW16c cpu0 NCHW32c OIHW32i16o 6fb734c77ed64bde
float32[1, 2, 56, 56,
32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16,
56, 56, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu 2,898.89 2.92
1 NCHW16c cpu0 NCHW3c OIHW3i16o 10a40e9231ff15a6
float32[1, 1, 224,
224, 3], float32[4, 1, 7, 7, 3, 16], float32[1, 4, 1, 1, 16], float32[1, 4,
112, 112, 16]
fused_nn_contrib_conv2d_NCHWc_3 2,659.84 2.67
1 NCHW16c cpu0 NCHW16c OIHW16i16o 9c3ea371f8ec4054
float32[1, 64, 14, 14, 16], float32[128, 64, 1, 1, 16, 16], float32[1, 128, 7,
7, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12 2,592.41 2.61
2 NCHW16c cpu0 NCHW16c OIHW16i16o 1cc8a4dccc794a64
float32[1, 128, 7, 7,
16], float32[32, 128, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32,
7, 7, 16]
fused_nn_contrib_conv2d_NCHWc_add_3 2,587.97 2.60
2 NCHW16c cpu0 NCHW32c OIHW32i16o 528b9cb523882d7e
float32[1, 16, 7, 7,
32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128,
7, 7, 16]
fused_nn_contrib_conv2d_NCHWc_1 2,568.47 2.58
1 NCHW64c cpu0 NCHW16c OIHW16i64o 9b9c1d5fc56b0353
float32[1, 16, 56, 56, 16], float32[8, 16, 1, 1, 16, 64], float32[1, 8, 28,
28, 64]
fused_nn_contrib_conv2d_NCHWc_2 2,560.30 2.57
1 NCHW16c cpu0 NCHW64c OIHW64i16o 371a9e61ecaeecce
float32[1, 8, 28, 28, 64], float32[64, 8, 1, 1, 64, 16], float32[1, 64, 14,
14, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3 2,393.13 2.41
2 NCHW64c cpu0 NCHW16c OIHW16i64o 850ecaa157c95aac
float32[1, 16, 56,
56, 16], float32[1, 16, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1,
56, 56, 64]
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu 1,519.12 1.53
1 NCHW16c cpu0 NCHW32c OIHW32i16o abe40a1f08b34bad
float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1,
32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16,
56, 56, 16]
fused_nn_contrib_conv2d_NCHWc 1,382.10 1.39
1 NCHW16c cpu0 NCHW16c OIHW16i16o 7661eb48c0b8a7e6
float32[1, 4, 56, 56, 16], float32[16, 4, 1, 1, 16, 16], float32[1, 16, 56,
56, 16]
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1 1,319.25 1.33
1 NCHW64c cpu0 NCHW32c OIHW32i64o 88bbb32f8f542f98
float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1,
32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28,
28, 64]
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2 1,299.49 1.31
1 NCHW16c cpu0 NCHW4c OIHW4i16o c7b912640028a9e2
float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1,
4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64,
14, 14, 16]
fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu 1,252.95 1.26
1 NCHW16c cpu0 NCHW32c OIHW32i16o 21cb6d538731ba92
float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7,
7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1, 1, 16], float32[1, 128,
7, 7, 16]
fused_add_nn_relu 823.04 0.83
2 cpu0 e907ce81104cda7a
float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56,
56, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10 759.67 0.76
1 NCHW16c cpu0 NCHW16c OIHW16i16o 8d07031ff51d0737
float32[1, 64, 14, 14,
16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7,
7, 16]
fused_nn_contrib_dense_pack_add 710.21 0.71
1 cpu0 7641a0cce9852143
float32[1, 2048], float32[40, 2048, 25], float32[1, 1000],
float32[1, 1000] NC25n
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7 693.32 0.70
1 NCHW16c cpu0 NCHW64c OIHW64i16o dc31662fedbb8185
float32[1, 8, 28, 28,
64], float32[16, 8, 1, 1, 64, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14,
14, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4 656.53 0.66
1 NCHW128c cpu0 NCHW16c OIHW16i128o 9b01f6479b89fd68
float32[1, 16, 56, 56,
16], float32[1, 16, 1, 1, 16, 128], float32[1, 1, 1, 1, 128], float32[1, 1, 28,
28, 128]
fused_add_nn_relu_1 631.05 0.63
3 cpu0 0e82013d73aa68c1
float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28,
28, 64]
fused_add_nn_relu_2 542.71 0.55
5 cpu0 f12067172f61c850
float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14,
14, 16]
fused_nn_max_pool2d_add_nn_relu 364.59 0.37
1 cpu0 6f701a4fa071030f
NCHW16c
float32[1, 4, 112, 112, 16], float32[1, 4, 1, 1, 16], float32[1,
4, 56, 56, 16]
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1 330.20 0.33
1 NCHW64c cpu0 NCHW16c OIHW16i64o 0f7bbb0e363c360c
float32[1, 4, 56,
56, 16], float32[1, 4, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1,
56, 56, 64]
fused_layout_transform_1 188.19 0.19
3 cpu0 b8cbb72b4035894d
float32[1, 4, 28, 28, 32], float32[1, 16, 28,
28, 8] NCHW8c NCHW32c
fused_layout_transform_2 172.13 0.17
6 cpu0 f5e631fb93d23d4d
float32[1, 16, 14, 14, 16], float32[1, 64, 14,
14, 4] NCHW4c NCHW16c
fused_add_nn_relu_3 106.41 0.11
2 cpu0 5d16c15878cc73d4
float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7,
7, 16]
fused_add_layout_transform 96.21 0.10
1 cpu0 69355d3cc810f874
float32[1, 3, 224, 224], float32[3, 1, 1], float32[1, 1, 224,
224, 3] NCHW3c NCHW
fused_nn_global_avg_pool2d 56.33 0.06
1 cpu0 f18307e2786f4cb3
NCHW16c
float32[1, 128, 7, 7, 16], float32[1,
128, 1, 1, 16]
fused_layout_transform 52.16 0.05
1 cpu0 2c5d64d5f9faa001
float32[1, 1, 28, 28, 128], float32[1, 16, 28,
28, 8] NCHW8c NCHW128c
fused_layout_transform_3 48.26 0.05
3 cpu0 add43c0d2d8a8a3c
float32[1, 32, 7, 7, 16], float32[1, 16, 7,
7, 32] NCHW32c NCHW16c
fused_nn_softmax 9.76 0.01
1 cpu0 ca61e79ea24e53f0
float32[1, 1000],
float32[1, 1000]
fused_layout_transform_nn_batch_flatten 1.73 0.00
1 cpu0 2db99463d18696a4
float32[1, 128, 1, 1, 16],
float32[1, 2048] NCHW NCHW16c
----------
Sum 98,275.48 98.83
84
Total 99,441.43
1 cpu0
```
(b) debug_executor
```
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake,
workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR',
(1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest
context. A fallback configuration is used, which may bring great performance
regression.
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake,
workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000,
2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A
fallback configuration is used, which may bring great performance regression.
One or more operators have not been tuned. Please tune your model for better
performance. Use DEBUG logging level to see more details.
Name Duration
(us) Percent Count out_layout Device data_layout kernel_layout
Hash layout
Argument Shapes dst_layout weight_layout src_layout
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_3
1,39,559.24 36.43 2 NCHW16c cpu0 NCHW32c OIHW32i16o
6fb734c77ed64bde
float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56,
16], float32[1, 16, 56, 56, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12
1,18,024.98 30.81 1 NCHW16c cpu0 NCHW3c OIHW3i16o
10a40e9231ff15a6
float32[1, 1, 224, 224, 3], float32[4, 1, 7, 7, 3, 16], float32[1, 4, 1, 1,
16], float32[1, 4, 112, 112, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc
23,051.66 6.02 1 NCHW16c cpu0 NCHW16c OIHW16i16o
7661eb48c0b8a7e6
float32[1, 4, 56, 56, 16], float32[16, 4, 1, 1, 16,
16], float32[1, 16, 56, 56, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3
15,185.61 3.96 6 NCHW16c cpu0 NCHW16c OIHW16i16o
efb9044cdd43e0b8
float32[1, 16, 14, 14, 16], float32[16, 16, 3, 3, 16, 16], float32[1, 16, 1, 1,
16], float32[1, 16, 14, 14, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9
13,328.36 3.48 3 NCHW32c cpu0 NCHW64c OIHW64i32o
83e0f5d1673ff2ae
float32[1, 1, 56, 56, 64], float32[2, 1, 3, 3, 64, 32], float32[1, 2, 1,
1, 32], float32[1, 2, 56, 56, 32]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11
13,159.49 3.44 1 NCHW64c cpu0 NCHW16c OIHW16i64o
0f7bbb0e363c360c
float32[1, 4, 56, 56, 16], float32[1, 4, 1, 1, 16, 64], float32[1, 1, 1,
1, 64], float32[1, 1, 56, 56, 64]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6
10,205.32 2.66 4 NCHW32c cpu0 NCHW8c OIHW8i32o
0d551fd3800939e1
float32[1, 16, 28, 28, 8], float32[4, 16, 3, 3, 8, 32], float32[1, 4, 1,
1, 32], float32[1, 4, 28, 28, 32]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu
7,727.92 2.02 3 NCHW16c cpu0 NCHW16c OIHW16i16o
68695c5cd347ce57
float32[1, 32, 7, 7, 16], float32[32, 32, 3, 3, 16, 16], float32[1, 32, 1,
1, 16], float32[1, 32, 7, 7, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_1
5,840.79 1.52 5 NCHW16c cpu0 NCHW4c OIHW4i16o
991e77362efe315d
float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14,
16], float32[1, 64, 14, 14, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4
5,746.35 1.50 5 NCHW16c cpu0 NCHW16c OIHW16i16o
c8d2fb74508242fa
float32[1, 64, 14, 14, 16], float32[16, 64, 1, 1, 16, 16], float32[1, 16, 1, 1,
16], float32[1, 16, 14, 14, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_2
3,745.35 0.98 3 NCHW64c cpu0 NCHW32c OIHW32i64o
b8f45dade76ef8ee
float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28,
28, 64], float32[1, 8, 28, 28, 64]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7
3,425.00 0.89 3 NCHW32c cpu0 NCHW64c OIHW64i32o
435cfe42fcb8d0b0
float32[1, 8, 28, 28, 64], float32[4, 8, 1, 1, 64, 32], float32[1, 4, 1,
1, 32], float32[1, 4, 28, 28, 32]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_2
2,508.48 0.65 1 NCHW16c cpu0 NCHW64c OIHW64i16o
371a9e61ecaeecce
float32[1, 8, 28, 28, 64], float32[64, 8, 1, 1, 64,
16], float32[1, 64, 14, 14, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10
2,400.83 0.63 2 NCHW64c cpu0 NCHW16c OIHW16i64o
850ecaa157c95aac
float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 64], float32[1, 1, 1,
1, 64], float32[1, 1, 56, 56, 64]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_1
2,396.47 0.63 1 NCHW64c cpu0 NCHW16c OIHW16i64o
9b9c1d5fc56b0353
float32[1, 16, 56, 56, 16], float32[8, 16, 1, 1,
16, 64], float32[1, 8, 28, 28, 64]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1
2,271.00 0.59 2 NCHW16c cpu0 NCHW16c OIHW16i16o
1cc8a4dccc794a64
float32[1, 128, 7, 7, 16], float32[32, 128, 1, 1, 16, 16], float32[1, 32, 1,
1, 16], float32[1, 32, 7, 7, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add
2,260.06 0.59 2 NCHW16c cpu0 NCHW32c OIHW32i16o
528b9cb523882d7e
float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7,
7, 16], float32[1, 128, 7, 7, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_3
2,240.88 0.59 1 NCHW16c cpu0 NCHW16c OIHW16i16o
9c3ea371f8ec4054
float32[1, 64, 14, 14, 16], float32[128, 64, 1, 1,
16, 16], float32[1, 128, 7, 7, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2
1,401.25 0.37 1 NCHW16c cpu0 NCHW32c OIHW32i16o
abe40a1f08b34bad float32[1, 2, 56, 56,
32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16,
1, 1, 16], float32[1, 16, 56, 56, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1
1,249.88 0.33 1 NCHW64c cpu0 NCHW32c OIHW32i64o
88bbb32f8f542f98 float32[1, 4, 28, 28,
32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 1,
1, 64], float32[1, 8, 28, 28, 64]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu
1,220.86 0.32 1 NCHW16c cpu0 NCHW32c OIHW32i16o
21cb6d538731ba92 float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32,
16], float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1,
1, 16], float32[1, 128, 7, 7, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu
1,160.16 0.30 1 NCHW16c cpu0 NCHW4c OIHW4i16o
c7b912640028a9e2 float32[1, 64, 14, 14,
4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1,
1, 16], float32[1, 64, 14, 14, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8
599.86 0.16 1 NCHW128c cpu0 NCHW16c OIHW16i128o
9b01f6479b89fd68
float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 128], float32[1, 1, 1, 1,
128], float32[1, 1, 28, 28, 128]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5
579.10 0.15 1 NCHW16c cpu0 NCHW64c OIHW64i16o
dc31662fedbb8185
float32[1, 8, 28, 28, 64], float32[16, 8, 1, 1, 64, 16], float32[1, 16, 1, 1,
16], float32[1, 16, 14, 14, 16]
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2
571.03 0.15 1 NCHW16c cpu0 NCHW16c OIHW16i16o
8d07031ff51d0737
float32[1, 64, 14, 14, 16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1,
1, 16], float32[1, 32, 7, 7, 16]
tvmgen_default_fused_add_nn_relu_3
519.35 0.14 2 cpu0
e907ce81104cda7a
float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1,
16], float32[1, 16, 56, 56, 16]
tvmgen_default_fused_nn_contrib_dense_pack_add
488.16 0.13 1 cpu0
7641a0cce9852143
float32[1, 2048], float32[40, 2048, 25],
float32[1, 1000], float32[1, 1000] NC25n
tvmgen_default_fused_add_nn_relu_2
360.30 0.09 3 cpu0
0e82013d73aa68c1
float32[1, 8, 28, 28, 64], float32[1, 8, 1,
1, 64], float32[1, 8, 28, 28, 64]
tvmgen_default_fused_nn_max_pool2d_add_nn_relu
342.65 0.09 1 cpu0
6f701a4fa071030f NCHW16c
float32[1, 4, 112, 112, 16], float32[1, 4, 1,
1, 16], float32[1, 4, 56, 56, 16]
tvmgen_default_fused_add_nn_relu_1
291.22 0.08 5 cpu0
f12067172f61c850
float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1,
16], float32[1, 64, 14, 14, 16]
tvmgen_default_fused_layout_transform_2
106.25 0.03 3 cpu0
b8cbb72b4035894d
float32[1, 4, 28,
28, 32], float32[1, 16, 28, 28, 8] NCHW8c NCHW32c
tvmgen_default_fused_add_layout_transform
76.45 0.02 1 cpu0
69355d3cc810f874
float32[1, 3, 224, 224], float32[3,
1, 1], float32[1, 1, 224, 224, 3] NCHW3c NCHW
tvmgen_default_fused_layout_transform_1
68.45 0.02 6 cpu0
f5e631fb93d23d4d
float32[1, 16, 14,
14, 16], float32[1, 64, 14, 14, 4] NCHW4c NCHW16c
tvmgen_default_fused_add_nn_relu
51.61 0.01 2 cpu0
5d16c15878cc73d4
float32[1, 128, 7, 7, 16], float32[1, 128, 1,
1, 16], float32[1, 128, 7, 7, 16]
tvmgen_default_fused_nn_global_avg_pool2d
46.06 0.01 1 cpu0
f18307e2786f4cb3 NCHW16c
float32[1, 128, 7,
7, 16], float32[1, 128, 1, 1, 16]
tvmgen_default_fused_layout_transform_3
36.66 0.01 1 cpu0
2c5d64d5f9faa001
float32[1, 1, 28, 28,
128], float32[1, 16, 28, 28, 8] NCHW8c NCHW128c
tvmgen_default_fused_layout_transform
11.41 0.00 3 cpu0
add43c0d2d8a8a3c
float32[1, 32, 7,
7, 16], float32[1, 16, 7, 7, 32] NCHW32c NCHW16c
tvmgen_default_fused_nn_softmax
9.50 0.00 1 cpu0
ca61e79ea24e53f0
float32[1, 1000], float32[1, 1000]
tvmgen_default_fused_layout_transform_nn_batch_flatten
1.05 0.00 1 cpu0
2db99463d18696a4
float32[1,
128, 1, 1, 16], float32[1, 2048] NCHW NCHW16c
----------
Sum
3,82,269.07 99.80 84
Total
3,83,036.30 1 cpu0
```
(c) benchmark
```
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake,
workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR',
(1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest
context. A fallback configuration is used, which may bring great performance
regression.
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake,
workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000,
2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A
fallback configuration is used, which may bring great performance regression.
One or more operators have not been tuned. Please tune your model for better
performance. Use DEBUG logging level to see more details.
Evaluate inference time cost...
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
95.1157 95.0706 95.2259 95.0505 0.0784
```
---
[Visit
Topic](https://discuss.tvm.apache.org/t/difference-in-profiler-outputs/11255/7)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/5caef41472d3f106bd825fe0e2066cbc1a0cceb73c11350f652c02bc43447615).