boh-inspur commented on a change in pull request #5428:
URL: https://github.com/apache/incubator-tvm/pull/5428#discussion_r416245487
##########
File path: src/target/source/codegen_cuda.cc
##########
@@ -274,9 +274,21 @@ void CodeGenCUDA::PrintVecElemLoad(
static const char access[] = {'x', 'y', 'z', 'w'};
CHECK(i >= 0 && i < (t.is_float16() ? 8 : 4));
if ((t.is_int()) && t.bits() == 8) {
- os << "((char)(" << vec << " >> " << i * 8 << "))";
+ if (t.lanes() == 1) {
+ os << vec;
+ } else if (t.lanes() == 2) {
+ os << vec << "." << access[i % 2];
+ } else {
+ os << "((char)(" << vec << " >> " << i * 8 << "))";
+ }
} else if ((t.is_uint()) && t.bits() == 8) {
- os << "((unsigned char)(" << vec << " >> " << i * 8 << "))";
+ if (t.lanes() == 1) {
Review comment:
Yes, we can do that. The generated code may like the following:
`char _1;`
`char _2 = ((char)((_1)<<0))`
That's correct, but the code is a little strange, and may takes more time in
runtime? If that's OK, I think we can common up logic for both branches, what
is your opinion?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]