On 05/09/18 14:55, Denis Khalikov wrote:
> Hi Wilco,
> thanks for the answer.
> 
>> Adding support for a frame chain would require an ABI change. It would
> have to
>> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
>> effort.
> 
> Clang already works that way.


> Please look at this commit:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459
> 
> 
> This is an example of code which clang generates
> with -mthumb -fno-omit-frame-pointer -O2.
> 
>  @ %bb.0:
>          push    {r4, r5, r6, r7, lr}
>          add     r7, sp, #12
>          push.w  {r8, r9, r10}
>          sub     sp, #56
>          mov     r8, r0
>          movw    r0, :lower16:_ZTVN6sensor14sensor_handlerE
>          movt    r0, :upper16:_ZTVN6sensor14sensor_handlerE
>          mov     r9, r8
>          adds    r0, #8
>          str     r0, [r9], #4
> 
> The only difference is clang sets frame pointer to the r7 on the stack
> instead gcc sets to lr,
> but it already handles by the Sanitizers.

Then Clang is broken.  You can't have different frame pointers in Arm
and Thumb code.

I object to another hack going in for another ill-specified frame
pointer variant until such time as the ABI is updated to sort this out
properly.  The frame handling code is just too critical to overall
performance and the prologue and epilogue code itself is also quite
fragile in this respect.  Adding more mess on top of that is going to
make sorting this out even more tricky: and once a patch like this goes
on it's not easy to remove it again.

So until the ABI sanctions a proper inter-function frame chain record,
GCC will only support local use of the frame pointer and no chaining.

R.

> 
>>
>> So this sounds like the first thing to do is reducing the size of the
> stack traces.
>> The default is 30 which is far larger than useful. Using 1 for example
> should
>> always be fast (since you can use __builtin_return_address(0)) and
> still get
>> the function that called malloc. Also if the unwinder happens to be
> too slow,
>> it should be optimized and caching added etc.
>>
> 
> If we change the size of the traces to 2, it could be something like this:
> 
> 0xb42a50a0 is located 0 bytes inside of 88-byte region
> [0xb42a50a0,0xb42a50f8)
> freed by thread T0 here:
>     #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7)
>     #1 0xb5fa64e3 in ipc::message::unref()
> (/lib/libsensord-shared.so+0xe4e3)
> 
> previously allocated by thread T0 here:
>     #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157)
>     #1 0xb2f8852d in accel_device::get_data(unsigned int,
> sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
> 
> Instead this:
> 
> 0xb42250a0 is located 0 bytes inside of 88-byte region
> [0xb42250a0,0xb42250f8)
> freed by thread T0 here:
>     #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff)
>     #1 0xb5fa64e3 in ipc::message::unref()
> (/lib/libsensord-shared.so+0xe4e3)
>     #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*,
> sensor_data_t*, int) (/usr/bin/sensord+0x1ef47)
>     #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned
> int) (/usr/bin/sensord+0x1dd43)
>     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
>     #5 0xb62d9a15 in g_main_context_dispatch
> (/lib/libglib-2.0.so.0+0x91a15)
>     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
>     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
>     #8 0xb5fa4e1b in ipc::event_loop::run(int)
> (/lib/libsensord-shared.so+0xce1b)
>     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
>     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)
> 
> previously allocated by thread T0 here:
>     #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f)
>     #1 0xb2f8852d in accel_device::get_data(unsigned int,
> sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
>     #2 0xb6ef848f in
> sensor::physical_sensor_handler::get_data(sensor_data_t**, int*)
> (/usr/bin/sensord+0x1b48f)
>     #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned
> int) (/usr/bin/sensord+0x1da51)
>     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
>     #5 0xb62d9a15 in g_main_context_dispatch
> (/lib/libglib-2.0.so.0+0x91a15)
>     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
>     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
>     #8 0xb5fa4e1b in ipc::event_loop::run(int)
> (/lib/libsensord-shared.so+0xce1b)
>     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
>     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)
> 
> 
> At the first example we lost the full context, from where the
> control/data flow comes from.
> 
>>
>> The issue is that the frame pointer and frame chain always add a large
>> overhead even when you do not use any sanitizers. This is especially bad
>> for the proposed patch - you lose much of the benefit of using Thumb-2...
>>
> 
> The stack layout like this enables only with compile time flag
> (-mthumb-fp and works only together with -mthumb and
> -fno-omit-frame-pointer). It does not affect other codegen.
> 
> Thanks.
> 
> On 09/05/2018 03:11 PM, Wilco Dijkstra wrote:
>> Hi Denis,
>>
>>> We are working on applying Address/LeakSanitizer for the full Tizen OS
>>> distribution. It's about ~1000 packages, ASan/LSan runtime is installed
>>> to ld.so.preload. As we know ASan/LSan has interceptors for
>>> allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
>>> On every allocation from user space program, ASan calls
>>> GET_STACK_TRACE_MALLOC;
>>> which unwinds the stack frame, and by default uses frame based stack
>>> unwinder. So, it requires to build with "-fno-omit-frame-pointer",
>>> switching it to default unwinder really hits the performance in our
>>> case.
>>
>> So this sounds like the first thing to do is reducing the size of the
>> stack traces.
>> The default is 30 which is far larger than useful. Using 1 for example
>> should
>> always be fast (since you can use __builtin_return_address(0)) and
>> still get
>> the function that called malloc. Also if the unwinder happens to be
>> too slow,
>> it should be optimized and caching added etc.
>>
>>>> Doing real unwinding is also far more accurate than frame pointer based
>>>> unwinding (the latter doesn't handle leaf functions correctly,
>>> entry/exit in
>>>> non-leaf functions and shrinkwrapped functions - and this breaks
>>> callgraph
>>>> profiling).
>>>
>>> I agree, but in our case, all interceptors for allocators are
>>> leaf functions, so the frame based stack unwinder works well for us.
>>
>> Yes a frame chain would work for this case. But it's not currently
>> supported.
>>
>>> By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
>>> because need frame based stack unwinder for every allocation, as I said
>>> before. As we know GCC sets fp to lr on the stack with
>>> ("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
>>> But the binary size is bigger than for thumb, so, we cannot use default
>>> thumb frame pointer and want to reduce binary size for the full
>>> sanitized image.
>>
>> The issue is that the frame pointer and frame chain always add a large
>> overhead even when you do not use any sanitizers. This is especially bad
>> for the proposed patch - you lose much of the benefit of using Thumb-2...
>>
>> Using normal unwinding means your code runs at full speed and still
>> can be
>> used by the sanitizer.
>>
>>> In other case clang works the same way, as I offered at the patch.
>>> It has the same issue, but it was fixed at the end of 2017
>>> https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
>>> discussion about APCS, but it is not the main point.)
>>>
>>> Also, unresolved issue related to this
>>> https://github.com/google/sanitizers/issues/640
>>
>> Adding support for a frame chain would require an ABI change. It would
>> have to
>> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
>> effort.
>>
>> Wilco
>>
>>

Reply via email to