Re: [gem5-users] one cpu keeps executing "@flush_tlb_others+133"

Zehan Cui via gem5-users Mon, 26 Jan 2015 17:34:58 -0800

Thanks. So I'd better to switch to another ISA.

Zehan


On Mon, Jan 26, 2015 at 4:30 PM, Andreas Hansson <andreas.hans...@arm.com>
wrote:

>  Hi Zehan,
>
>  There are not too many people using X86 full-system (based on what I’ve
> seen at least), and as such it is not very well tested. I think it’s fair
> to say that ARM is the most well-tested ISA, especially in full-system. ARM
> full-system also supports recent linux kernels (
> http://www.gem5.org/Running_gem5#Experimenting_with_DVFS).
>
>  Andreas
>
>   From: Zehan Cui <zehan....@gmail.com>
> Date: Monday, 26 January 2015 02:36
> To: Andreas Hansson <andreas.hans...@arm.com>
> Cc: gem5 users mailing list <gem5-users@gem5.org>
> Subject: Re: [gem5-users] one cpu keeps executing "@flush_tlb_others+133"
>
>  btw, I'm using X86 ISA.
>
> On Mon, Jan 26, 2015 at 10:34 AM, Zehan Cui <zehan....@gmail.com> wrote:
>
>> Hi Andreas,
>>
>>  This problem still bothers me and I found more related problems.
>>
>>  The above problem is that the O3 CPU is stuck in such a loop (for at
>> least 1 billion instructions) while doing page_fault():
>>
>>  8007580397602: system.switch_cpus_17 T0 : @flush_tlb_others+133    :
>> NOP                      : IntAlu :
>> 8007580397602: system.switch_cpus_17 T0 : @flush_tlb_others+135    : cmp
>>    DS:[rbp], 0
>> 8007580397602: system.switch_cpus_17 T0 : @flush_tlb_others+139    : jnz
>>    0xfffffffffffffff8
>>
>>  I've already took checkpoints before the region of interest, and tried
>> to initialize all objects before that. But there are still page faults in
>> execution.
>>
>>  For another application without page faults, the O3 CPU is stuck in the
>> following loop which doing omp_unset_lock():
>>
>>  8418495964433: system.switch_cpus_14 T0 : @_spin_lock+5    :   NOP
>>                  : IntAlu :
>> 8418495964433: system.switch_cpus_14 T0 : @_spin_lock+7    : cmp
>>  DS:[rdi], 0
>> 8418495964433: system.switch_cpus_14 T0 : @_spin_lock+10    : jle
>> 0xfffffffffffffff9
>>
>>  Then, I tried to boot Linux on a 4-core system using timing CPUs,
>> however, CPU0 is also stuck in a loop for at least 195,498,501,784,500
>> ticks:
>>
>>  197494667971500: system.cpu0 T0 : @__smp_call_function+160    :   NOP
>>                    : IntAlu :
>> 197494668076500: system.cpu0 T0 : @__smp_call_function+162    : cmp rbx,
>> DS:[rsp + 0x14]
>> 197494668115500: system.cpu0 T0 : @__smp_call_function+166    : jnz 
>> 0xfffffffffffffff8
>>
>>
>>  the other CPUs remain idle except processing apic_timer_interrupt()
>> every 4ms. The terminal stop at:
>>
>>  Booting processor 1/4 APIC 0x1
>> Initializing CPU#1
>> Calibrating delay loop (skipped)... 3999.96 BogoMIPS preset
>> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>> CPU: L2 Cache: 1024K (64 bytes/line)
>> Fake M5 x86_64 CPU stepping 01
>> Booting processor 2/4 APIC 0x2
>> Initializing CPU#2
>> Calibrating delay loop (skipped)... 3999.96 BogoMIPS preset
>> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>> CPU: L2 Cache: 1024K (64 bytes/line)
>> Fake M5 x86_64 CPU stepping 01
>> Booting processor 3/4 APIC 0x3
>> Initializing CPU#3
>> Calibrating delay loop (skipped)... 3999.96 BogoMIPS preset
>> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>> CPU: L2 Cache: 1024K (64 bytes/line)
>> Fake M5 x86_64 CPU stepping 01
>> Brought up 4 CPUs
>> migration_cost=185
>>
>>
>>  I have tried version e5936c2d53a0 and a0cb57e1c072, and they both have
>> such problems.
>>
>>  Do you have any idea?
>>
>>  thanks,
>> Zehan
>>
>> On Tue, Jan 20, 2015 at 5:01 PM, Zehan Cui <zehan....@gmail.com> wrote:
>>
>>> Hi Andreas,
>>>
>>>  The atomic CPU and o3 CPU do execute the same instructions, except
>>> that the atomic CPU exits the above loop soon, while the o3 CPU is stuck in
>>> the loop for at least one billion instructions. Last mail only showed the
>>> stuck loop instructions. The whole instruction sequence contains:
>>> @page_fault
>>> @error_entry
>>> @do_page_fault
>>> @find_vma
>>> @__handle_mm_fault
>>> @filemap_nopage
>>> @find_get_page
>>> ...
>>> Such instruction sequence seems like processing page fault.
>>>
>>>  There is a static array in the source code without initialization,
>>> which may cause the page fault. I'll initialize the array before the
>>> checkpoint and see what happens. But it's still strange that the o3 CPU
>>> cannot exit the loop for such a long time.
>>>
>>>  Thanks,
>>> Zehan
>>>
>>>
>>> On Tue, Jan 20, 2015 at 4:40 PM, Andreas Hansson <
>>> andreas.hans...@arm.com> wrote:
>>>
>>>>  Hi Zehan,
>>>>
>>>>  The o3 CPU will invariably take roughly 5-10x as long due to the
>>>> level of detail. Are you suggesting the atomic CPU and the o3 CPU are not
>>>> executing the same instructions?
>>>>
>>>>  Typically in these cases you want to drop a checkpoint before the
>>>> region of interest.
>>>>
>>>>  Andreas
>>>>
>>>>   From: Zehan Cui via gem5-users <gem5-users@gem5.org>
>>>> Reply-To: Zehan Cui <zehan....@gmail.com>, gem5 users mailing list <
>>>> gem5-users@gem5.org>
>>>> Date: Tuesday, 20 January 2015 02:14
>>>> To: gem5-users <gem5-users@gem5.org>
>>>> Subject: [gem5-users] one cpu keeps executing "@flush_tlb_others+133"
>>>>
>>>>  Hi all,
>>>>
>>>>  I run a multi-threaded application in full system mode with detailed
>>>> cpu model. I extract the instruction traces of each cpu, and find that the
>>>> last cpu keeps executing instructions like the following for at least 1
>>>> billion instructions (The max_instructions is set to 1 billion).
>>>>
>>>>   8007580397289: system.switch_cpus_17 T0 : @flush_tlb_others+133    :
>>>>   NOP                      : IntAlu :
>>>> 8007580397289: system.switch_cpus_17 T0 : @flush_tlb_others+135    :
>>>> cmp    DS:[rbp], 0
>>>> 8007580397289: system.switch_cpus_17 T0 : @flush_tlb_others+139    :
>>>> jnz    0xfffffffffffffff8
>>>> 8007580397602: system.switch_cpus_17 T0 : @flush_tlb_others+133    :
>>>> NOP                      : IntAlu :
>>>> 8007580397602: system.switch_cpus_17 T0 : @flush_tlb_others+135    :
>>>> cmp    DS:[rbp], 0
>>>> 8007580397602: system.switch_cpus_17 T0 : @flush_tlb_others+139    :
>>>> jnz    0xfffffffffffffff8
>>>> 8007580397915: system.switch_cpus_17 T0 : @flush_tlb_others+133    :
>>>> NOP                      : IntAlu :
>>>> 8007580397915: system.switch_cpus_17 T0 : @flush_tlb_others+135    :
>>>> cmp    DS:[rbp], 0
>>>> 8007580397915: system.switch_cpus_17 T0 : @flush_tlb_others+139    :
>>>> jnz    0xfffffffffffffff8
>>>> 8007580398228: system.switch_cpus_17 T0 : @flush_tlb_others+133    :
>>>> NOP                      : IntAlu :
>>>>
>>>>
>>>>  I run the application with atomic cpu model. The same instruction
>>>> sequence appears for a while, but soon switches to the instructions of the
>>>> application.
>>>>
>>>>  Such problem has bothered me for a while. Does anyone understand this?
>>>>
>>>>  thanks,
>>>> zehan
>>>>
>>>> -- IMPORTANT NOTICE: The contents of this email and any attachments are
>>>> confidential and may also be privileged. If you are not the intended
>>>> recipient, please notify the sender immediately and do not disclose the
>>>> contents to any other person, use it for any purpose, or store or copy the
>>>> information in any medium. Thank you.
>>>>
>>>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>>> Registered in England & Wales, Company No: 2557590
>>>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>>> 9NJ, Registered in England & Wales, Company No: 2548782
>>>>
>>>
>>>
>>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] one cpu keeps executing "@flush_tlb_others+133"

Reply via email to