that said, there is no reason profile should help in this particular test case. In general, STL/template code style is more due to proper inlining heuristic that enable later phases do the right thing. For example, if you have short trip counts, profile should help so that LNO will not harm performance. IOW, if the trip count is high enough, LNO's transformation should be able to "vectorize" by analysis Sun
On Thu, Jun 28, 2012 at 10:48 AM, Sun Chan <sun.c...@gmail.com> wrote: > I will answer the last question here. > We want to allow flexibility in terms of profiling. Each phase > benefits differently due to profile. The infrastructure allows that, > but later on, they seemed to get turned on by default and > indiscriminantly. E.g. LNO only need to profile loop trip count, cache > misses etc. So the profiling at LNO time should be selected "regions" > only. Someone should carefully scale down and tune what to profile at > what phase instead of simply turn on profiling everywhere and for all > phases. Value profile shouldn't be on by default (even for Ofast, > IMHO) > Sun > >> >> I simply analyze the overhead. It is caused by a number of >> profile initialization functions (__profile_init and __profile_pu_init) in >> the hot loop body. >> >> Analysis: >> The instrumentation is called before VHO by default. Each PU has its >> own __profile_init() and __profile_pu_init(). For FTensor, there are many >> _init() functions in the hot loop body and thus introduce high overhead >> after intensive inlining. >> >> for i=1...100000 >> foo(); bar(); zoo(); >> -> >> for i=1...100000 >> >> __profile_init();__profile_pu_init();...;__profile_init();__profile_pu_init();...;__profile_init();__profile_pu_init();...; >> >> You can use -fb_phase=1 to instrument before LNO which is after inlining. It >> is about 10 time slower. But the performance gain is only 2% on an old >> Opteron machine. >> >> The generated SSE instructions in the loop body is key to the performance >> and need further investigation for the tuning. >> >> Thought: >> For C++ programs, many function bodies are very small. Instrumentation >> overhead is high if these small functions are called in the hot region. I >> suggest to perform "simple" inlining before instrumentation to reduce the >> overhead according to my experience in another compiler. >> >> I also disable value profiling for evaluation, and it is light-weighted for >> the case. >> >> BTW: Does open64 remove option -fb_type=N? I want to disable value profile >> with this option, but opencc complains libcginstr.so cannot be found. I skim >> through the code and find value profile is always enabled for WN_Instrument. >> libcginstr is not handled by configure and not in the >> osprey/targdir_lib2. That is, CG_* profile cannot work due to the lack of >> libcginstr.so. Is the library deprecated? Maybe I'm out... >> >> Please correct me if I'm wrong. >> >> == >> >> I’d like to take this opportunity to ask a question. >> Open64 support instrumentation in four phases (VHO, LNO, WOPT, CG). What was >> the motivation and driven force? Could you share knowledge or experience on >> their pros and cons in reality? And why BEFORE_VHO is set as default >> fb_phase? >> >> Thanks a ton! >> >> On Tue, Jun 26, 2012 at 6:20 PM, Sun Chan <sun.c...@gmail.com> wrote: >>> >>> someone must be doing value profile (memory op profile) to get to >>> these kind of slow down. Of course, it could be something really >>> stupid. My recollection is, it should be no more that 5 times slower >>> back then >>> Sun >>> >>> On Tue, Jun 26, 2012 at 5:33 PM, Jian-Xin Lai <laij...@gmail.com> wrote: >>> > Yes, you are right. I measured PGO on both "-O3 -OPT:Ofast" and >>> > "-Ofast" and found the PGO for "-Ofast" is much slower than 20x. >>> > >>> > For Tensor 3: >>> > fb_create run: >>> > real 52m39.138s >>> > user 52m36.841s >>> > sys 0m0.232s >>> > fb_opt run: >>> > real 0m7.622s >>> > user 0m7.572s >>> > sys 0m0.000s >>> > >>> > I haven't check why the overhead is so high. >>> > >>> > 2012/6/21 Walter Landry <wlan...@caltech.edu>: >>> >> Jian-Xin Lai <laij...@gmail.com> wrote: >>> >>> I tried the Open64 PGO on these benchmarks. Basically, the training >>> >>> executable runs about 20 times slower. I guess the overhead of open64 >>> >>> PGO is comparable as ICC. >>> >> >>> >> What are the exact options you used when trying PGO? I found that the >>> >> C-tran code was about 20 times slower, but the expression template >>> >> code was much worse than that. >>> >> >>> >>> But there is not much performance gain from Open64 PGO. Since all >>> >>> test cases are single file, "-O3 -OPT:Ofast" may works better. >>> >> >>> >> That what I would have thought, but the FTensor results for the Intel >>> >> compiler were much, much improved with PGO. >>> >> >>> >> Thanks, >>> >> Walter Landry >>> >> >>> >> >>> >> ------------------------------------------------------------------------------ >>> >> Live Security Virtual Conference >>> >> Exclusive live event will cover all the ways today's security and >>> >> threat landscape has changed and how IT managers can respond. >>> >> Discussions >>> >> will include endpoint security, mobile security and the latest in >>> >> malware >>> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> >> _______________________________________________ >>> >> Open64-devel mailing list >>> >> Open64-devel@lists.sourceforge.net >>> >> https://lists.sourceforge.net/lists/listinfo/open64-devel >>> > >>> > >>> > >>> > -- >>> > Regards, >>> > Lai Jian-Xin >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > Live Security Virtual Conference >>> > Exclusive live event will cover all the ways today's security and >>> > threat landscape has changed and how IT managers can respond. >>> > Discussions >>> > will include endpoint security, mobile security and the latest in >>> > malware >>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> > _______________________________________________ >>> > Open64-devel mailing list >>> > Open64-devel@lists.sourceforge.net >>> > https://lists.sourceforge.net/lists/listinfo/open64-devel >>> >>> >>> ------------------------------------------------------------------------------ >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> Open64-devel mailing list >>> Open64-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/open64-devel >> >> >> >> >> -- >> Regards, >> Peng Yuan (袁鹏) >> ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Open64-devel mailing list Open64-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/open64-devel