[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-31 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #12 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #10)
> (In reply to Hongtao.liu from comment #4)
> > Also there's separate issue, codegen for below is not optimal
> > gimple:
> > _11 = VIEW_CONVERT_EXPR(a_3(D))
> > asm:
> > mov QWORD PTR [rsp-24], rdi
> > mov QWORD PTR [rsp-16], rsi
> > movdqa  xmm0, XMMWORD PTR [rsp-24]
> > 
> > 
> > I think this issue has been recorded in several existed PRs.
> 
> Maybe this can be solved with secondary_reload when GPR and XMM regs are
> involved.

PR104306 is an experiment.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-28 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #11 from Jakub Jelinek  ---
With -O3 it regresses with
r7-2009-g8d4fc2d3d0c8f87bb3e182be1a618a511f8f9465
__uint128_t bswap(__uint128_t a) { return __builtin_bswap128 (a); }
emits the optimal code but is only in GCC 11.1 and later.

One fix for this might be to handle
  _8 = BIT_FIELD_REF ;
  _1 = __builtin_bswap64 (_8);
  y[0] = _1;
  _10 = BIT_FIELD_REF ;
  _2 = __builtin_bswap64 (_10);
  y[1] = _2;
  _7 = MEM  [(char * {ref-all})];
in bswap or store merging.
Though, current bswap infrastructure I'm afraid limits it to 64-bit size,
because it tracks the bytes in uint64_t vars and uses 8 bits to determine which
byte it is (0 value of zero, 1-8 byte index and 0xff unknown).
While that is 10 different values right now, if we handled uint128_t we'd need
18 different values times 16.

Note, even:
unsigned long long
bswap (unsigned long long a)
{
  unsigned int x[2];
  __builtin_memcpy (x, , 8);
  unsigned int y[2];
  y[0] = __builtin_bswap32 (x[1]);
  y[1] = __builtin_bswap32 (x[0]);
  __builtin_memcpy (, y, 8);
  return a;
}

unsigned long long
bswap2 (unsigned long long a)
{
  return __builtin_bswap64 (a);
}
emits better code in the latter function rather than former store-merging
isn't able to handle even that.
So we want to handle it in store-merging, we should start with handling
  _8 = BIT_FIELD_REF ;
  _1 = __builtin_bswap32 (_8);
  _10 = (unsigned int) a_3(D);
  _2 = __builtin_bswap32 (_10);
  _11 = {_1, _2};
  _5 = VIEW_CONVERT_EXPR(_11);
and
  _8 = BIT_FIELD_REF ;
  _1 = __builtin_bswap32 (_8);
  y[0] = _1;
  _10 = (unsigned int) a_3(D);
  _2 = __builtin_bswap32 (_10);
  y[1] = _2;
  _7 = MEM  [(char * {ref-all})];
and only once that is handled try
  _8 = BIT_FIELD_REF ;
  _1 = __builtin_bswap64 (_8);
  _10 = (long long unsigned int) a_3(D);
  _2 = __builtin_bswap64 (_10);
  _11 = {_1, _2};
  _5 = VIEW_CONVERT_EXPR(_11);
Doesn't look like stage4 material though.

So in the meantime perhaps some other improvements.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #10 from Uroš Bizjak  ---
(In reply to Hongtao.liu from comment #4)
> Also there's separate issue, codegen for below is not optimal
> gimple:
> _11 = VIEW_CONVERT_EXPR(a_3(D))
> asm:
> mov QWORD PTR [rsp-24], rdi
> mov QWORD PTR [rsp-16], rsi
> movdqa  xmm0, XMMWORD PTR [rsp-24]
> 
> 
> I think this issue has been recorded in several existed PRs.

Maybe this can be solved with secondary_reload when GPR and XMM regs are
involved.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #9 from Richard Biener  ---
(In reply to Richard Biener from comment #8)
> (In reply to rsand...@gcc.gnu.org from comment #7)
> > (In reply to Richard Biener from comment #6)
> > > Richard - I'm sure we can construct a similar case for aarch64 where
> > > argument passing and vector mode use cause spilling?
> > > 
> > > On x86 the simplest testcase showing this is
> > > 
> > > typedef unsigned long long v2di __attribute__((vector_size(16)));
> > > v2di bswap(__uint128_t a)
> > > {
> > > return *(v2di *)
> > > }
> > > 
> > > that produces
> > > 
> > > bswap:
> > > .LFB0:
> > > .cfi_startproc
> > > sub sp, sp, #16
> > > .cfi_def_cfa_offset 16
> > > stp x0, x1, [sp]
> > > ldr q0, [sp]
> > > add sp, sp, 16
> > > .cfi_def_cfa_offset 0
> > > ret
> > > 
> > > on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
> > > though and I'm not sure there's another way to move x0/x1 to q0.
> > It looks like this is a deliberate choice for aarch64.  The generic
> > costing has:
> > 
> >   /* Avoid the use of slow int<->fp moves for spilling by setting
> >  their cost higher than memmov_cost.  */
> >   5, /* GP2FP  */
> > 
> > So in cases like the above, we're telling IRA that spilling to
> > memory and reloading is cheaper than moving between registers.
> > For -mtune=thunderx we generate:
> > 
> > fmovd0, x0
> > ins v0.d[1], x1
> > ret
> > 
> > instead.
> 
> Ah, interesting.  On x86 we disallow/pessimize GPR<->XMM moves with
> some tunings as well, still there a sequence like
> 
>movq%rdi, -24(%rsp)
>movq%rsi, -16(%rsp)
>movq-24(%rsp), %xmm0
>movq-16(%rsp), %xmm1
>unpckhpd  %xmm0, %xmm1   (fixme - that's wrong, but you get the idea)
> 
> instead of
> 
> movq%rdi, -24(%rsp)
> movq%rsi, -16(%rsp)
> movdqa  -24(%rsp), %xmm0
> 
> would likely be faster.  Not sure if one can get LRA to produce this
> two-staged reload with just appropriate costing though.  As said the
> key of the cost of the bad sequence is the failing store forwarding,
> so it's special for spilling of two-GPR TImode and reloading as
> single FPR V2DImode.

And a speciality for aarch64 seems to be that it has arguments
passed in (reg:TI x0) which supposedly is a register-pair.  On x86
there are no TImode register pair registers I think, instead the
__int128 is passed as two 8-bytes in regular GPRs.  So on aarch64
we have the simpler

(insn 13 3 10 2 (set (reg:TI 95)
(reg:TI 0 x0 [ a ])) "t.ii":3:2 58 {*movti_aarch64}
 (expr_list:REG_DEAD (reg:TI 0 x0 [ a ])
(nil)))
(insn 10 13 11 2 (set (reg/i:V2DI 32 v0)
(subreg:V2DI (reg:TI 95) 0)) "t.ii":5:2 1173 {*aarch64_simd_movv2di}
 (expr_list:REG_DEAD (reg:TI 95)
(nil)))

before RA.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #8 from Richard Biener  ---
(In reply to rsand...@gcc.gnu.org from comment #7)
> (In reply to Richard Biener from comment #6)
> > Richard - I'm sure we can construct a similar case for aarch64 where
> > argument passing and vector mode use cause spilling?
> > 
> > On x86 the simplest testcase showing this is
> > 
> > typedef unsigned long long v2di __attribute__((vector_size(16)));
> > v2di bswap(__uint128_t a)
> > {
> > return *(v2di *)
> > }
> > 
> > that produces
> > 
> > bswap:
> > .LFB0:
> > .cfi_startproc
> > sub sp, sp, #16
> > .cfi_def_cfa_offset 16
> > stp x0, x1, [sp]
> > ldr q0, [sp]
> > add sp, sp, 16
> > .cfi_def_cfa_offset 0
> > ret
> > 
> > on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
> > though and I'm not sure there's another way to move x0/x1 to q0.
> It looks like this is a deliberate choice for aarch64.  The generic
> costing has:
> 
>   /* Avoid the use of slow int<->fp moves for spilling by setting
>  their cost higher than memmov_cost.  */
>   5, /* GP2FP  */
> 
> So in cases like the above, we're telling IRA that spilling to
> memory and reloading is cheaper than moving between registers.
> For -mtune=thunderx we generate:
> 
> fmovd0, x0
> ins v0.d[1], x1
> ret
> 
> instead.

Ah, interesting.  On x86 we disallow/pessimize GPR<->XMM moves with
some tunings as well, still there a sequence like

   movq%rdi, -24(%rsp)
   movq%rsi, -16(%rsp)
   movq-24(%rsp), %xmm0
   movq-16(%rsp), %xmm1
   unpckhpd  %xmm0, %xmm1   (fixme - that's wrong, but you get the idea)

instead of

movq%rdi, -24(%rsp)
movq%rsi, -16(%rsp)
movdqa  -24(%rsp), %xmm0

would likely be faster.  Not sure if one can get LRA to produce this
two-staged reload with just appropriate costing though.  As said the
key of the cost of the bad sequence is the failing store forwarding,
so it's special for spilling of two-GPR TImode and reloading as
single FPR V2DImode.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #7 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #6)
> Richard - I'm sure we can construct a similar case for aarch64 where
> argument passing and vector mode use cause spilling?
> 
> On x86 the simplest testcase showing this is
> 
> typedef unsigned long long v2di __attribute__((vector_size(16)));
> v2di bswap(__uint128_t a)
> {
> return *(v2di *)
> }
> 
> that produces
> 
> bswap:
> .LFB0:
> .cfi_startproc
> sub sp, sp, #16
> .cfi_def_cfa_offset 16
> stp x0, x1, [sp]
> ldr q0, [sp]
> add sp, sp, 16
> .cfi_def_cfa_offset 0
> ret
> 
> on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
> though and I'm not sure there's another way to move x0/x1 to q0.
It looks like this is a deliberate choice for aarch64.  The generic
costing has:

  /* Avoid the use of slow int<->fp moves for spilling by setting
 their cost higher than memmov_cost.  */
  5, /* GP2FP  */

So in cases like the above, we're telling IRA that spilling to
memory and reloading is cheaper than moving between registers.
For -mtune=thunderx we generate:

fmovd0, x0
ins v0.d[1], x1
ret

instead.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Richard Biener  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org
   Priority|P3  |P2
 Target||x86_64-*-*

--- Comment #6 from Richard Biener  ---
With just SSE2 we get the store vectorized only:

bswap:
.LFB0:
.cfi_startproc
bswap   %rsi
bswap   %rdi
movq%rsi, %xmm0
movq%rdi, %xmm1
punpcklqdq  %xmm1, %xmm0
movaps  %xmm0, -24(%rsp)
movq-24(%rsp), %rax
movq-16(%rsp), %rdx
ret

the

 1 times vec_perm costs 4 in body
BIT_FIELD_REF  1 times scalar_stmt costs 4 in body
BIT_FIELD_REF  1 times scalar_stmt costs 4 in body

costs are what we cost building the initial vector from __int128
compared to splitting that into a low/high part.

   [local count: 1073741824]:
  _8 = BIT_FIELD_REF ;
  _11 = VIEW_CONVERT_EXPR(a_3(D));
  _13 = VIEW_CONVERT_EXPR(a_3(D));
  _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>;
  _14 = VIEW_CONVERT_EXPR(_12);
  _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11,
10, 9, 8 }>;
  _16 = VIEW_CONVERT_EXPR(_15);
  _1 = __builtin_bswap64 (_8);
  _10 = BIT_FIELD_REF ;
  _2 = __builtin_bswap64 (_10);
  MEM  [(long long unsigned int *)] = _16;
  _7 = MEM  [(char * {ref-all})];

doesn't realize that it can maybe move the hi/lo swap across the two
permutes to before the store, otherwise it looks as expected.

Yes, the vectorizer doesn't account for ABI details on the function boundary
but it's very hard to do that in a sensible way.

Practically the worst part of the generated code is

movq%rdi, -24(%rsp)
movq%rsi, -16(%rsp)
movdqa  -24(%rsp), %xmm0

because the store will fail to forward, causing a huge performance issue.
I wonder why we fail to merge those.  We face

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
(reg:DI 97)) "t.c":2:1 80 {*movdi_internal}
 (expr_list:REG_DEAD (reg:DI 97)
(nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
(reg:DI 98)) "t.c":2:1 80 {*movdi_internal}
 (expr_list:REG_DEAD (reg:DI 98)
(nil)))
(note 7 27 12 2 NOTE_INSN_FUNCTION_BEG)
(insn 12 7 14 2 (set (reg:V2DI 91)
(vec_select:V2DI (subreg:V2DI (reg/v:TI 87 [ a ]) 0)
(parallel [
(const_int 1 [0x1])
(const_int 0 [0])
]))) "t.c":6:12 7927 {*ssse3_palignrv2di_perm}
 (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
(nil)))

where

Trying 26, 27 -> 12:
   26: r87:TI#0=r97:DI
  REG_DEAD r97:DI
   27: r87:TI#8=r98:DI
  REG_DEAD r98:DI
   12: r91:V2DI=vec_select(r87:TI#0,parallel)
  REG_DEAD r87:TI
Can't combine i2 into i3

possibly because 27 is a partial def of r87.  We expand to

(insn 4 3 5 2 (set (reg:TI 88)
(subreg:TI (reg:DI 89) 0)) "t.c":2:1 -1
 (nil))
(insn 5 4 6 2 (set (subreg:DI (reg:TI 88) 8)
(reg:DI 90)) "t.c":2:1 -1
 (nil))
(insn 6 5 7 2 (set (reg/v:TI 87 [ a ])
(reg:TI 88)) "t.c":2:1 -1
 (nil))
(note 7 6 10 2 NOTE_INSN_FUNCTION_BEG)
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
(subreg:V2DI (reg/v:TI 87 [ a ]) 0)) -1
 (nil))
(insn 12 10 13 2 (set (reg:V2DI 91)
(vec_select:V2DI (reg:V2DI 82 [ _11 ])
(parallel [
(const_int 1 [0x1])
(const_int 0 [0])
]))) "t.c":6:12 -1
 (nil))

initially from

  _11 = VIEW_CONVERT_EXPR(a_3(D));

and fwprop1 still sees

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
(reg:DI 89 [ a ])) "t.c":2:1 80 {*movdi_internal}
 (expr_list:REG_DEAD (reg:DI 95 [ a ])
(nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
(reg:DI 90 [ a+8 ])) "t.c":2:1 80 {*movdi_internal}
 (expr_list:REG_DEAD (reg:DI 96 [+8 ])
(nil)))
(note 7 27 10 2 NOTE_INSN_FUNCTION_BEG) 
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
(subreg:V2DI (reg/v:TI 87 [ a ]) 0)) 1700 {movv2di_internal}
 (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
(nil)))

so that would be the best place to fix this up, realizing reg 87 dies
after insn 10.

Richard - I'm sure we can construct a similar case for aarch64 where
argument passing and vector mode use cause spilling?

On x86 the simplest testcase showing this is

typedef unsigned long long v2di __attribute__((vector_size(16)));
v2di bswap(__uint128_t a)
{
return *(v2di *)
}

that produces

bswap:
.LFB0:
.cfi_startproc
sub sp, sp, #16
.cfi_def_cfa_offset 16
stp x0, x1, [sp]
ldr q0, [sp]
add sp, sp, 16
.cfi_def_cfa_offset 0
ret

on arm for me.  Maybe the 

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #5 from Hongtao.liu  ---
> 1. According to ABI, uint128 is passed by 2 gpr, and there should be extra
> cost for _11 = VIEW_CONVERT_EXPR(a_3(D));

And no scalar_cost is needed for
BIT_FIELD_REF  1 times scalar_stmt costs 4 in body
BIT_FIELD_REF  1 times scalar_stmt costs 4 in body

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #4 from Hongtao.liu  ---
Also there's separate issue, codegen for below is not optimal
gimple:
_11 = VIEW_CONVERT_EXPR(a_3(D))
asm:
mov QWORD PTR [rsp-24], rdi
mov QWORD PTR [rsp-16], rsi
movdqa  xmm0, XMMWORD PTR [rsp-24]


I think this issue has been recorded in several existed PRs.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #3 from Hongtao.liu  ---
172_1 1 times scalar_store costs 12 in body
173_2 1 times scalar_store costs 12 in body
174__builtin_bswap64 (_8) 1 times scalar_stmt costs 4 in body
175__builtin_bswap64 (_10) 1 times scalar_stmt costs 4 in body
176BIT_FIELD_REF  1 times scalar_stmt costs 4 in body
177BIT_FIELD_REF  1 times scalar_stmt costs 4 in body
1781 1 times vec_perm costs 4 in body
179__builtin_bswap64 (_8) 1 times vector_stmt costs 4 in prologue
180__builtin_bswap64 (_8) 1 times vec_perm costs 4 in body
181_1 1 times vector_store costs 16 in body
182test.cc:9:10: note: Cost model analysis for part in loop 0:
183  Vector cost: 28
184  Scalar cost: 40
...
243   [local count: 1073741824]:
244  _8 = BIT_FIELD_REF ;
245  _11 = VIEW_CONVERT_EXPR(a_3(D));
246  _13 = VIEW_CONVERT_EXPR(a_3(D));
247  _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>;
248  _14 = VIEW_CONVERT_EXPR(_12);
249  _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12,
11, 10, 9, 8 }>;
250  _16 = VIEW_CONVERT_EXPR(_15);
251  _1 = __builtin_bswap64 (_8);
252  _10 = BIT_FIELD_REF ;
253  _2 = __builtin_bswap64 (_10);
254  MEM  [(long unsigned int *)] = _16;
255  _7 = MEM  [(char * {ref-all})];


1. According to ABI, uint128 is passed by 2 gpr, and there should be extra cost
for _11 = VIEW_CONVERT_EXPR(a_3(D));
2. Why there's 1781 1 times vec_perm costs 4 in body, should it be 2
times vec_perm costs?

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #2 from Hongtao.liu  ---
with -fno-tree-vectorize, gcc also produce optimal code.

mov rax, rsi
mov rdx, rdi
bswap   rax
bswap   rdx
ret

Guess it's related to vectorizer cost model.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

2022-01-20 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed||2022-01-20
 Status|UNCONFIRMED |NEW
Summary|x86: excessive code |[9/10/11/12 Regression]
   |generated for 128-bit   |x86: excessive code
   |byteswap|generated for 128-bit
   ||byteswap
 Ever confirmed|0   |1
 Blocks||101926
   Target Milestone|--- |12.0
  Known to work||6.1.0
  Component|target  |middle-end
   Keywords||missed-optimization

--- Comment #1 from Andrew Pinski  ---
GCC 11 and before does at -O2 (GCC 6 and before could it for -O3 too):

mov rax, rsi
mov rdx, rdi
bswap   rax
bswap   rdx

The reason is SLP vectorizer is turned on for -O2 in GCC 12.
We get:

  _11 = {_1, _2};
  _5 = VIEW_CONVERT_EXPR(_11);


The expansion of this could be done using move instructions 

I notice for aarch64, SLP kicks in even more and does the following:

fmovd0, x0
fmovv0.d[1], x1
ext v0.16b, v0.16b, v0.16b, #8
rev64   v0.16b, v0.16b
umovx0, v0.d[0]
umovx1, v0.d[1]

This is even true for -O2 -mavx too:

mov QWORD PTR [rsp-24], rdi
mov QWORD PTR [rsp-16], rsi
vmovdqa xmm1, XMMWORD PTR [rsp-24]
vpalignrxmm0, xmm1, xmm1, 8
vpshufb xmm2, xmm0, XMMWORD PTR .LC0[rip]
vmovdqa XMMWORD PTR [rsp-24], xmm2
mov rax, QWORD PTR [rsp-24]
mov rdx, QWORD PTR [rsp-16]

There are so many different little regressions when handling this code it
seems.
But I think it all comes down to modeling arguments and return value on the
gimple level which breaks this.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex argument passing and return should be
improved