ENZO

Sun Chan Wed, 27 Jun 2012 19:52:32 -0700

that said, there is no reason profile should help in this particular
test case. In general, STL/template code style is more due to proper
inlining heuristic that enable later phases do the right thing. For
example, if you have short trip counts, profile should help so that
LNO will not harm performance. IOW, if the trip count is high enough,
LNO's transformation should be able to "vectorize" by analysis
Sun


On Thu, Jun 28, 2012 at 10:48 AM, Sun Chan <[email protected]> wrote:
> I will answer the last question here.
> We want to allow flexibility in terms of profiling. Each phase
> benefits differently due to profile. The infrastructure allows that,
> but later on, they seemed to get turned on by default and
> indiscriminantly. E.g. LNO only need to profile loop trip count, cache
> misses etc. So the profiling at LNO time should be selected "regions"
> only. Someone should carefully scale down and tune what to profile at
> what phase instead of simply turn on profiling everywhere and for all
> phases. Value profile shouldn't be on by default (even for Ofast,
> IMHO)
> Sun
>
>>
>> I simply analyze the overhead. It is caused by a number of
>> profile initialization functions (__profile_init and __profile_pu_init) in
>> the hot loop body.
>>
>> Analysis:
>> The instrumentation is called before VHO by default. Each PU has its
>> own __profile_init() and __profile_pu_init(). For FTensor, there are many
>> _init() functions in the hot loop body and thus introduce high overhead
>> after intensive inlining.
>>
>> for i=1...100000
>>    foo(); bar(); zoo();
>> ->
>> for i=1...100000
>>
>>  __profile_init();__profile_pu_init();...;__profile_init();__profile_pu_init();...;__profile_init();__profile_pu_init();...;
>>
>> You can use -fb_phase=1 to instrument before LNO which is after inlining. It
>> is about 10 time slower. But the performance gain is only 2% on an old
>> Opteron machine.
>>
>> The generated SSE instructions in the loop body is key to the performance
>> and need further investigation for the tuning.
>>
>> Thought:
>> For C++ programs, many function bodies are very small. Instrumentation
>> overhead is high if these small functions are called in the hot region. I
>> suggest to perform "simple" inlining before instrumentation to reduce the
>> overhead according to my experience in another compiler.
>>
>> I also disable value profiling for evaluation, and it is light-weighted for
>> the case.
>>
>> BTW: Does open64 remove option -fb_type=N?  I want to disable value profile
>> with this option, but opencc complains libcginstr.so cannot be found. I skim
>> through the code and find value profile is always enabled for WN_Instrument.
>>  libcginstr is not handled by configure and not in the
>> osprey/targdir_lib2. That is, CG_* profile cannot work due to the lack of
>> libcginstr.so. Is the library deprecated?  Maybe I'm out...
>>
>> Please correct me if I'm wrong.
>>
>> ==
>>
>> I’d like to take this opportunity to ask a question.
>> Open64 support instrumentation in four phases (VHO, LNO, WOPT, CG). What was
>> the motivation and driven force? Could you share knowledge or experience on
>> their pros and cons in reality?  And why BEFORE_VHO is set as default
>> fb_phase?
>>
>> Thanks a ton!
>>
>> On Tue, Jun 26, 2012 at 6:20 PM, Sun Chan <[email protected]> wrote:
>>>
>>> someone must be doing value profile (memory op profile) to get to
>>> these kind of slow down. Of course, it could be something really
>>> stupid. My recollection is, it should be no more that 5 times slower
>>> back then
>>> Sun
>>>
>>> On Tue, Jun 26, 2012 at 5:33 PM, Jian-Xin Lai <[email protected]> wrote:
>>> > Yes, you are right. I measured PGO on both "-O3 -OPT:Ofast" and
>>> > "-Ofast" and found the PGO for "-Ofast" is much slower than 20x.
>>> >
>>> > For Tensor 3:
>>> > fb_create run:
>>> > real    52m39.138s
>>> > user    52m36.841s
>>> > sys     0m0.232s
>>> > fb_opt run:
>>> > real    0m7.622s
>>> > user    0m7.572s
>>> > sys     0m0.000s
>>> >
>>> > I haven't check why the overhead is so high.
>>> >
>>> > 2012/6/21 Walter Landry <[email protected]>:
>>> >> Jian-Xin Lai <[email protected]> wrote:
>>> >>> I tried the Open64 PGO on these benchmarks. Basically, the training
>>> >>> executable runs about 20 times slower. I guess the overhead of open64
>>> >>> PGO is comparable as ICC.
>>> >>
>>> >> What are the exact options you used when trying PGO?  I found that the
>>> >> C-tran code was about 20 times slower, but the expression template
>>> >> code was much worse than that.
>>> >>
>>> >>> But there is not much performance gain from Open64 PGO. Since all
>>> >>> test cases are single file, "-O3 -OPT:Ofast" may works better.
>>> >>
>>> >> That what I would have thought, but the FTensor results for the Intel
>>> >> compiler were much, much improved with PGO.
>>> >>
>>> >> Thanks,
>>> >> Walter Landry
>>> >>
>>> >>
>>> >> ------------------------------------------------------------------------------
>>> >> Live Security Virtual Conference
>>> >> Exclusive live event will cover all the ways today's security and
>>> >> threat landscape has changed and how IT managers can respond.
>>> >> Discussions
>>> >> will include endpoint security, mobile security and the latest in
>>> >> malware
>>> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> >> _______________________________________________
>>> >> Open64-devel mailing list
>>> >> [email protected]
>>> >> https://lists.sourceforge.net/lists/listinfo/open64-devel
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Lai Jian-Xin
>>> >
>>> >
>>> > ------------------------------------------------------------------------------
>>> > Live Security Virtual Conference
>>> > Exclusive live event will cover all the ways today's security and
>>> > threat landscape has changed and how IT managers can respond.
>>> > Discussions
>>> > will include endpoint security, mobile security and the latest in
>>> > malware
>>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> > _______________________________________________
>>> > Open64-devel mailing list
>>> > [email protected]
>>> > https://lists.sourceforge.net/lists/listinfo/open64-devel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Open64-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/open64-devel
>>
>>
>>
>>
>> --
>> Regards,
>> Peng Yuan (袁鹏)
>>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Open64-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/open64-devel

Re: [Open64-devel] C++ Expression Template Benchmarks for GCC/Clang/Intel/PGI/MSVC/Open64/ENZO

Reply via email to