Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

Zhao Wu Thu, 11 Jun 2020 04:17:01 -0700

> 1. It will be hard to do this. The point is that the legalization is done in 
> Relay before picking the strategy (thus, it is unaware of the strategy 
> picked). To keep both legalizations I need somehow to pass information from 
> the strategy (e.g., the name of the algorithm, or something like that). Are 
> you aware of any other ways I can do it?


@giuseros I think add the algorithm name could be one way to handle it. For 
example, we could add it in the `attr` and query it in the legalization pass, 
then we could throw it safely.

> Note that I am targeting NHWC layout. I wasn't able to even compile with 
> conv2d_nhwc_spatial_pack for uint8 (it just hangs, at least when I tried it 
> without auto-tuning on Armv8-A). I gathered from various discussions that 
> NHWC support for arm targets is incomplete at the moment. So for now we might 
> simply agree to leave this as default for NHWC and conv2d_nchw_spatial_pack 
> as default for NCHW and mirror that in the legalization step which might look 
> like:
    ```if is_aarch64_arm() and attrs.data_layout == "NHWC":
        return helper_change_dtypes_to_be_same(attrs, inputs, types, 
relay.qnn.op.conv2d)
    return helper_no_fast_int8_hw_legalization(attrs, inputs, types, 
relay.qnn.op.conv2d)```

Yes, our NHWC schedule on arm cpu doesn't be complete. After our careful 
testing, NHWC is also perform better than NCHW on arm cpu using Ansor (aka auto 
scheduler) too. So this prompts us we could improve our AutoTVM NHWC schedule 
on arm cpu too.  As the result I show in the post,  we use auto schedule is to 
leverage NHWC layout and `smlal` instruction,  I prefer we could leverage 
`attr[algorithm_name]` mentioned previous to keep `smlal` instruction. After 
auto scheduler released (we are working hard to do it, we wish after 2 weeks we 
could bring it in), we could see how to improve it (like generating smlal and 
smlal2 or your tensorize instruction), they are orthogonal but they share the 
same legalize pass.

One background of auto scheduler: In auto scheduler, we only need tvm.compute, 
then we could generate schedule automatically, so we could try NHWC / NCHW 
easily. So there is no spatial pack schedule template concept in the auto 
scheduler world in fact.

> About the Raspberry+mobilenet v2, good to know you are working on Armv8-A 
> (sorry to have assumed otherwise). However, there is still the point that 
> mobilenet uses shallow convolutions, while I am addressing deeper and more 
> generic convolutions.

So we should keep both algorithm better, right?


> Are you saying that, as things stand now in TVM, the conv2d_nhwc_spatial_pack 
> schedule might be faster than the gemm approach on smaller CPUs? 
> Unfortunately, for now I don't think they can be added together because of 
> what I said above about the legalization step. Do you know any work-around to 
> that? Maybe I can legalize only for specific devices (e.g., only for 
> Cortex-A55)?

I think add algorithm name mentioned before maybe could help to solve it.

> Finally, as things stand now we might get this PR in, and later do a more 
> detailed comparison across different networks + CPUs

Ok. I buy it in. After legalization pass we discussed is solved, I am glad to 
do code review carefully and handle this pr.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642577581

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

Reply via email to