Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Fri, May 6, 2011 at 7:57 PM, Xinliang David Li davi...@google.com wrote: I want propose a more general solution. 1) Generic Annotation Support for gcc IR -- it is used attach to application/optimization specific annotation to gimple statements and annotations can be passed around across passes. In gcc, I only see HISTOGRAM annotation for value profiling, which is not general enough 2) Support of CallInfo for each callsite. This is an annotation, but more standardized. The callinfo can be used to record information such as call attributes, call side effects, mod-ref information etc --- current gimple_call_flags can be folded into this Info structure. I don't like generic annotation facilities. What should passes to with annotated stmts that are a) transformed, b) removed? See RTL notes and all the interesting issues they cause. Then how do you store information that needs to be passed across optimization passes -- you can not possibly dump all of them into the core IR. In fact, anything that is derived from (via analysis) but not part of the core IR need to worry about update and maintenance. In current GIMPLE, we can find many such instances -- DU chains, Memory SSA, control flow information, as well as flags like visited, no_warning, PLF (?), etc. Have a unified way of representing them is a good thing so that 1) make the IR lean and mean; 2) avoid too many different side data structures. The important thing is to have a good verifier to catch insanity and inconsistency of the annotation after each pass. Well, as soon as you have a verifier it's no longer generic but _is_ part of the core IL. Similar to how EH info is part of the core IL, or alias info, or loop information. Richard.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Sat, May 7, 2011 at 5:46 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, May 6, 2011 at 7:57 PM, Xinliang David Li davi...@google.com wrote: I want propose a more general solution. 1) Generic Annotation Support for gcc IR -- it is used attach to application/optimization specific annotation to gimple statements and annotations can be passed around across passes. In gcc, I only see HISTOGRAM annotation for value profiling, which is not general enough 2) Support of CallInfo for each callsite. This is an annotation, but more standardized. The callinfo can be used to record information such as call attributes, call side effects, mod-ref information etc --- current gimple_call_flags can be folded into this Info structure. I don't like generic annotation facilities. What should passes to with annotated stmts that are a) transformed, b) removed? See RTL notes and all the interesting issues they cause. Then how do you store information that needs to be passed across optimization passes -- you can not possibly dump all of them into the core IR. In fact, anything that is derived from (via analysis) but not part of the core IR need to worry about update and maintenance. In current GIMPLE, we can find many such instances -- DU chains, Memory SSA, control flow information, as well as flags like visited, no_warning, PLF (?), etc. Have a unified way of representing them is a good thing so that 1) make the IR lean and mean; 2) avoid too many different side data structures. The important thing is to have a good verifier to catch insanity and inconsistency of the annotation after each pass. Well, as soon as you have a verifier it's no longer generic but _is_ part of the core IL. Similar to how EH info is part of the core IL, or alias info, or loop information. The difference is that no-core IL can be thrown away anytime and be recomputed if needed. David Richard.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Thu, May 5, 2011 at 7:02 PM, Xinliang David Li davi...@google.com wrote: On Thu, May 5, 2011 at 2:16 AM, Richard Guenther richard.guent...@gmail.com wrote: On Thu, May 5, 2011 at 12:19 AM, Xinliang David Li davi...@google.com wrote: I can think of some more-or-less obvious high-level forms, one would for example simply stick a new DISPATCH tree into gimple_call_fn (similar to how we can have OBJ_TYPE_REF there), the DISPATCH tree would be of variable length, first operand the selector function and further operands function addresses. That would keep the actual call visible (instead of a fake __builtin_dispatch call), something I'd really like to see. This sounds like a good long term solution. Thinking about it again maybe, similar to OBJ_TYPE_REF, have the selection itself lowered and only keep the set of functions as additional info. Thus instead of having the selector function as first operand have a pointer to the selected function there (that also avoids too much knowledge about the return value of the selector). Thus, sel = selector (); switch (sel) { case A: fn = bar; case B: fn = foo; } val = (*DISPATCH (fn, bar, foo)) (...); that way regular optimizations can apply to the selection, eventually discard the dispatch if fn becomes a known direct function (similar to devirtualization). At expansion time the call address is simply taken from the first operand and an indirect call is assembled. Does the above still provide enough knowledge for the IPA path isolation? I like your original proposal (extending call) better because related information are tied together and is easier to hoist and clean up. I want propose a more general solution. 1) Generic Annotation Support for gcc IR -- it is used attach to application/optimization specific annotation to gimple statements and annotations can be passed around across passes. In gcc, I only see HISTOGRAM annotation for value profiling, which is not general enough 2) Support of CallInfo for each callsite. This is an annotation, but more standardized. The callinfo can be used to record information such as call attributes, call side effects, mod-ref information etc --- current gimple_call_flags can be folded into this Info structure. I don't like generic annotation facilities. What should passes to with annotated stmts that are a) transformed, b) removed? See RTL notes and all the interesting issues they cause. Similarly (not related to this discussion), LoopInfo structure can be introduced to annotate loop back edge jumps to allow FE to pass useful information at loop level. For floating pointer operations, things like the precision constraint, sensitivity to floating environment etc can be recorded in FPInfo. Yes, the idea is to keep the loop structures live throughout the whole compilation. Just somebody needs to do the last 1% of work. Richard. T Restricting ourselves to use the existing target attribute at the beginning (with a single, compiler-generated selector function) is probably good enough to get a prototype up and running. Extending it to arbitrary selector-function, value pairs using a new attribute is then probably easy (I don't see the exact use-case for that yet, but I suppose it exists if you say so). For the use cases, CPU model will be looked at instead of just the core architecture -- this will give use more information about the numbrer of cores, size of caches etc. Intel's runtime library does this checkiing at start up time so that the multi-versioned code can look at those and make the appropriate decisions. It will be even more complicated for arm processors -- which can have the same processor cores but configured differently w.r.t VFP, NEON etc. Ah, indeed. I hadn't thought about the tuning for different variants as opposed to enabling HW features. So the interface for overloading would be sth like enum X { Foo = 0, Bar = 5 }; enum X select () { return Bar; } void foo (void) __attribute__((dispatch(select, Bar))); Yes, for overloading -- something like this looks good. Thanks, David
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Fri, May 6, 2011 at 04:55, Richard Guenther richard.guent...@gmail.com wrote: On Thu, May 5, 2011 at 7:02 PM, Xinliang David Li davi...@google.com wrote: 2) Support of CallInfo for each callsite. This is an annotation, but more standardized. The callinfo can be used to record information such as call attributes, call side effects, mod-ref information etc --- current gimple_call_flags can be folded into this Info structure. I don't like generic annotation facilities. What should passes to with annotated stmts that are a) transformed, b) removed? See RTL notes and all the interesting issues they cause. Likewise. We kind of tried having them in the early days of gimple and tree-ssa, but quickly removed them. Anything that is not a first-class IL member, makes life difficult. We have some examples in PHI nodes and EH regions. They're a bit to the side, and require extra code to manage. Diego.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
I want propose a more general solution. 1) Generic Annotation Support for gcc IR -- it is used attach to application/optimization specific annotation to gimple statements and annotations can be passed around across passes. In gcc, I only see HISTOGRAM annotation for value profiling, which is not general enough 2) Support of CallInfo for each callsite. This is an annotation, but more standardized. The callinfo can be used to record information such as call attributes, call side effects, mod-ref information etc --- current gimple_call_flags can be folded into this Info structure. I don't like generic annotation facilities. What should passes to with annotated stmts that are a) transformed, b) removed? See RTL notes and all the interesting issues they cause. Then how do you store information that needs to be passed across optimization passes -- you can not possibly dump all of them into the core IR. In fact, anything that is derived from (via analysis) but not part of the core IR need to worry about update and maintenance. In current GIMPLE, we can find many such instances -- DU chains, Memory SSA, control flow information, as well as flags like visited, no_warning, PLF (?), etc. Have a unified way of representing them is a good thing so that 1) make the IR lean and mean; 2) avoid too many different side data structures. The important thing is to have a good verifier to catch insanity and inconsistency of the annotation after each pass. Thanks, David Similarly (not related to this discussion), LoopInfo structure can be introduced to annotate loop back edge jumps to allow FE to pass useful information at loop level. For floating pointer operations, things like the precision constraint, sensitivity to floating environment etc can be recorded in FPInfo. Yes, the idea is to keep the loop structures live throughout the whole compilation. Just somebody needs to do the last 1% of work. Richard. T Restricting ourselves to use the existing target attribute at the beginning (with a single, compiler-generated selector function) is probably good enough to get a prototype up and running. Extending it to arbitrary selector-function, value pairs using a new attribute is then probably easy (I don't see the exact use-case for that yet, but I suppose it exists if you say so). For the use cases, CPU model will be looked at instead of just the core architecture -- this will give use more information about the numbrer of cores, size of caches etc. Intel's runtime library does this checkiing at start up time so that the multi-versioned code can look at those and make the appropriate decisions. It will be even more complicated for arm processors -- which can have the same processor cores but configured differently w.r.t VFP, NEON etc. Ah, indeed. I hadn't thought about the tuning for different variants as opposed to enabling HW features. So the interface for overloading would be sth like enum X { Foo = 0, Bar = 5 }; enum X select () { return Bar; } void foo (void) __attribute__((dispatch(select, Bar))); Yes, for overloading -- something like this looks good. Thanks, David
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Thu, May 5, 2011 at 12:19 AM, Xinliang David Li davi...@google.com wrote: I can think of some more-or-less obvious high-level forms, one would for example simply stick a new DISPATCH tree into gimple_call_fn (similar to how we can have OBJ_TYPE_REF there), the DISPATCH tree would be of variable length, first operand the selector function and further operands function addresses. That would keep the actual call visible (instead of a fake __builtin_dispatch call), something I'd really like to see. This sounds like a good long term solution. Thinking about it again maybe, similar to OBJ_TYPE_REF, have the selection itself lowered and only keep the set of functions as additional info. Thus instead of having the selector function as first operand have a pointer to the selected function there (that also avoids too much knowledge about the return value of the selector). Thus, sel = selector (); switch (sel) { case A: fn = bar; case B: fn = foo; } val = (*DISPATCH (fn, bar, foo)) (...); that way regular optimizations can apply to the selection, eventually discard the dispatch if fn becomes a known direct function (similar to devirtualization). At expansion time the call address is simply taken from the first operand and an indirect call is assembled. Does the above still provide enough knowledge for the IPA path isolation? Restricting ourselves to use the existing target attribute at the beginning (with a single, compiler-generated selector function) is probably good enough to get a prototype up and running. Extending it to arbitrary selector-function, value pairs using a new attribute is then probably easy (I don't see the exact use-case for that yet, but I suppose it exists if you say so). For the use cases, CPU model will be looked at instead of just the core architecture -- this will give use more information about the numbrer of cores, size of caches etc. Intel's runtime library does this checkiing at start up time so that the multi-versioned code can look at those and make the appropriate decisions. It will be even more complicated for arm processors -- which can have the same processor cores but configured differently w.r.t VFP, NEON etc. Ah, indeed. I hadn't thought about the tuning for different variants as opposed to enabling HW features. So the interface for overloading would be sth like enum X { Foo = 0, Bar = 5 }; enum X select () { return Bar; } void foo (void) __attribute__((dispatch(select, Bar))); which either means having pairs of function / select return value in the DISPATCH operands or having it partly lowered as I outlined above. For the overloading to work we probably have to force that the functions are local (so we can mangle them arbitrarily) and that if the function should be visible externally people add an externally visible dispatcher (foo in the above example would be one). For most of the cases, probably only the primary/default version needs to be publicly visible .. Yeah. And that one we eventually can auto-transform to use IFUNC relocations. Richard.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Tue, May 3, 2011 at 11:57 PM, Xinliang David Li davi...@google.com wrote: On Tue, May 3, 2011 at 3:00 AM, Richard Guenther richard.guent...@gmail.com wrote: On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:33 PM, Richard Guenther richard.guent...@gmail.com wrote: On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. I don't think so. With your patch only two passes understand the new high-level form, the rest of the gimple passes are just confused. There is no need for other passes to understand it -- just treat it as opaque calls. This is goodness otherwise other passes need to be modified. This is true (only some passes understand it) for things like __builtin_expect. Certainly __builtin_dispatch has to be understood by alias analysis and all other passes that care about calls (like all IPA passes). You can of course treat it conservatively (may call any function, even those which have their address not taken, clobber and read all memory, even that which doesn't escape the TU). Why obfuscate things when it is not necessary? MVed functions are usually non-trivial, so I doubt anything will be lost due to the obfuscation. It won't be too difficult to teach aliaser to 'merge' the attributes from target functions either. No that is not my argument. What I tried to say is it will be harder to achieve without high level semantics -- it requires more handshaking between compiler passes. Sure - that's life. We are looking at improving the life .. Which nobody will see benefit from unless they rewrite their code? The target users for the builtin include compiler itself -- it can synthesize dispatch calls. Hum. I'm not at all sure the dispatch calls are the best representation for the IL. The intension is to provide an interface at both C level (for programmers) and IL level. It does not have to be a builtin (both internally and externally) -- but it needs to map to some language construct. Well, I say if we can improve _some_ of the existing usages that's better than never doing wrong on a new language extension. This is independent. It is not. One that I'm not convinced is the way to go (you didn't address at all the inability to use float arguments and the ABI issues with using variadic arguments - after all you did a poor-mans language extension by using GCC builtins instead of inventing a true one). This is an independent issue that either needs to be addressed or marked as limitation. The key of the debate is whether source/IR annotation using construct with high level semantics helps optimizer. In fact this is common. Would it make any difference (in terms of acceptance) if the builtin is only used internally by the compiler and not exposed to the user? No. I don't see at all why having everything in a single stmt is so much more convenient. And I don't see why existing IL features cannot be used to make things a little more convenient. Why not? The high level construct is simpler to deal with. It is all about doing the right optimization at the right level of abstraction. Set aside the question whether using builtin for MV dispatch is the right high level construct, looking at gcc, we can find that gcc's IR is pretty low level resulting in missing optimizations. For instance, there is no high level doloop representation -- Fortran do-loop needs to be lowered and raised back again -- the consequence is that you may not raise the loop nest into the way it was originally written -- perfect nested loop become non-perfect loop nest -- blocking certain loop transformations. Not only that, I am not sure it is even possible to record any loop level information anywhere -- is it possible to have per loop attribute such as unroll factor? Assuming gcc can do full math function inlining (for common math routines) -- in this case, do we still want to do sin/cos optimization or rely on the scalar optimizer to optimize the inlined copies of sin and cos? Not sure about gcc, I remember that dead
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Wed, May 4, 2011 at 15:35, Sriraman Tallam tmsri...@google.com wrote: * tree-pass.h (pass_tree_convert_builtin_dispatch): New pass. (pass_ipa_multiversion_dispatch): New pass. * builtin-types.def (BT_PTR_FN_INT): New pointer type. (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch. * builtins.def (BUILT_IN_DISPATCH): New builtin to support multi-version calls. * mversn-dispatch.c: New file. * timevar.de (TV_MVERSN_DISPATCH): New time var. * common.opt (fclone-hot-version-paths): New flag. * Makefile.in (mversn-dispatch.o): New rule. * passes.c (init_optimization_passes): Add the new multi-version and dispatch passes to the pass list. * params.def (PARAM_NUMBER_OF_MVERSN_CLONES): Define. (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define. * doc/invoke.texi (mversn-clone-depth): Document. (num-mversn-clones): Document. (fclone-hot-version-paths): Document. * testsuite/gcc.dg/mversn7.c: New test. * testsuite/gcc.dg/mversn4.c: New test. * testsuite/gcc.dg/mversn4.h: New test. * testsuite/gcc.dg/mversn4a.c: New test. * testsuite/gcc.dg/torture/mversn1.c: New test. * testsuite/gcc.dg/mversn2.c: New test. * testsuite/gcc.dg/mversn6.c: New test. * testsuite/gcc.dg/mversn3.c: New test. * testsuite/g++.dg/mversn8.C: New test. * testsuite/g++.dg/mversn10a.C: New test. * testsuite/g++.dg/mversn14a.C: New test. * testsuite/g++.dg/tree-prof/mversn13.C: New test. * testsuite/g++.dg/tree-prof/mversn15.C: New test. * testsuite/g++.dg/tree-prof/mversn15a.C: New test. * testsuite/g++.dg/mversn9.C: New test. * testsuite/g++.dg/mversn10.C: New test. * testsuite/g++.dg/mversn12.C: New test. * testsuite/g++.dg/mversn14.C: New test. * testsuite/g++.dg/mversn16.C: New test. * testsuite/g++.dg/torture/mversn11.C: New test. * testsuite/g++.dg/torture/mversn5.C: New test. * testsuite/g++.dg/torture/mversn5.h: New test. * testsuite/g++.dg/torture/mversn5a.C: New test. * c-family/c-common.c (handle_version_selector_attribute): New function. (c_common_attribute_table): New attribute version_selector. OK. Thanks for the quick fix! Diego.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
I submitted the patch. Thanks, -Sri. On Wed, May 4, 2011 at 3:13 PM, Diego Novillo dnovi...@google.com wrote: On Wed, May 4, 2011 at 15:35, Sriraman Tallam tmsri...@google.com wrote: * tree-pass.h (pass_tree_convert_builtin_dispatch): New pass. (pass_ipa_multiversion_dispatch): New pass. * builtin-types.def (BT_PTR_FN_INT): New pointer type. (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch. * builtins.def (BUILT_IN_DISPATCH): New builtin to support multi-version calls. * mversn-dispatch.c: New file. * timevar.de (TV_MVERSN_DISPATCH): New time var. * common.opt (fclone-hot-version-paths): New flag. * Makefile.in (mversn-dispatch.o): New rule. * passes.c (init_optimization_passes): Add the new multi-version and dispatch passes to the pass list. * params.def (PARAM_NUMBER_OF_MVERSN_CLONES): Define. (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define. * doc/invoke.texi (mversn-clone-depth): Document. (num-mversn-clones): Document. (fclone-hot-version-paths): Document. * testsuite/gcc.dg/mversn7.c: New test. * testsuite/gcc.dg/mversn4.c: New test. * testsuite/gcc.dg/mversn4.h: New test. * testsuite/gcc.dg/mversn4a.c: New test. * testsuite/gcc.dg/torture/mversn1.c: New test. * testsuite/gcc.dg/mversn2.c: New test. * testsuite/gcc.dg/mversn6.c: New test. * testsuite/gcc.dg/mversn3.c: New test. * testsuite/g++.dg/mversn8.C: New test. * testsuite/g++.dg/mversn10a.C: New test. * testsuite/g++.dg/mversn14a.C: New test. * testsuite/g++.dg/tree-prof/mversn13.C: New test. * testsuite/g++.dg/tree-prof/mversn15.C: New test. * testsuite/g++.dg/tree-prof/mversn15a.C: New test. * testsuite/g++.dg/mversn9.C: New test. * testsuite/g++.dg/mversn10.C: New test. * testsuite/g++.dg/mversn12.C: New test. * testsuite/g++.dg/mversn14.C: New test. * testsuite/g++.dg/mversn16.C: New test. * testsuite/g++.dg/torture/mversn11.C: New test. * testsuite/g++.dg/torture/mversn5.C: New test. * testsuite/g++.dg/torture/mversn5.h: New test. * testsuite/g++.dg/torture/mversn5a.C: New test. * c-family/c-common.c (handle_version_selector_attribute): New function. (c_common_attribute_table): New attribute version_selector. OK. Thanks for the quick fix! Diego.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
I can think of some more-or-less obvious high-level forms, one would for example simply stick a new DISPATCH tree into gimple_call_fn (similar to how we can have OBJ_TYPE_REF there), the DISPATCH tree would be of variable length, first operand the selector function and further operands function addresses. That would keep the actual call visible (instead of a fake __builtin_dispatch call), something I'd really like to see. This sounds like a good long term solution. Restricting ourselves to use the existing target attribute at the beginning (with a single, compiler-generated selector function) is probably good enough to get a prototype up and running. Extending it to arbitrary selector-function, value pairs using a new attribute is then probably easy (I don't see the exact use-case for that yet, but I suppose it exists if you say so). For the use cases, CPU model will be looked at instead of just the core architecture -- this will give use more information about the numbrer of cores, size of caches etc. Intel's runtime library does this checkiing at start up time so that the multi-versioned code can look at those and make the appropriate decisions. It will be even more complicated for arm processors -- which can have the same processor cores but configured differently w.r.t VFP, NEON etc. For the overloading to work we probably have to force that the functions are local (so we can mangle them arbitrarily) and that if the function should be visible externally people add an externally visible dispatcher (foo in the above example would be one). For most of the cases, probably only the primary/default version needs to be publicly visible .. Thanks, David Now, a language extension to support multi-versioning should be completely independent on any IL representation - with using a builtin you are tying them together with the only convenient mechanism we have - a mechanism that isn't optimal for either side IMNSHO. Yes -- they don't have to be tied -- they just happen to suite the needs of both ends -- but I see the value of the latest proposal (overloading) above. I did realize that using builtins was convenient (been there and done the same for some experiments ...). Richard.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:33 PM, Richard Guenther richard.guent...@gmail.com wrote: On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. I don't think so. With your patch only two passes understand the new high-level form, the rest of the gimple passes are just confused. There is no need for other passes to understand it -- just treat it as opaque calls. This is goodness otherwise other passes need to be modified. This is true (only some passes understand it) for things like __builtin_expect. Certainly __builtin_dispatch has to be understood by alias analysis and all other passes that care about calls (like all IPA passes). You can of course treat it conservatively (may call any function, even those which have their address not taken, clobber and read all memory, even that which doesn't escape the TU). Why obfuscate things when it is not necessary? 1) the desired optimization may not happen subject to many compiler heuristic changes; 2) it has other side effects such as wrong estimation of function size which may impact inlining May, may ... so you say all this can't happen under any circumstance with your special code and passes? No that is not my argument. What I tried to say is it will be harder to achieve without high level semantics -- it requires more handshaking between compiler passes. Sure - that's life. Which nobody will see benefit from unless they rewrite their code? The target users for the builtin include compiler itself -- it can synthesize dispatch calls. Hum. I'm not at all sure the dispatch calls are the best representation for the IL. Well, I say if we can improve _some_ of the existing usages that's better than never doing wrong on a new language extension. This is independent. It is not. One that I'm not convinced is the way to go (you didn't address at all the inability to use float arguments and the ABI issues with using variadic arguments - after all you did a poor-mans language extension by using GCC builtins instead of inventing a true one). This is an independent issue that either needs to be addressed or marked as limitation. The key of the debate is whether source/IR annotation using construct with high level semantics helps optimizer. In fact this is common. Would it make any difference (in terms of acceptance) if the builtin is only used internally by the compiler and not exposed to the user? No. I don't see at all why having everything in a single stmt is so much more convenient. And I don't see why existing IL features cannot be used to make things a little more convenient. 3) it limits the lowering into one form which may not be ideal -- with builtin_dispatch, after hoisting optimization, the lowering can use more efficient IFUNC scheme, for instance. I see no reason why we cannot transform a switch-indirect-call pattern into an IFUNC call. It is possible -- but it is like asking user to lower the dispatch and tell compiler to raise it again .. There is no possibility for a high-level dispatch at the source level. And if I'd have to design one I would use function overloading, like float compute_sth (float) __attribute__((version(sse4))) { ... sse4 code ... } float compute_sth (float) { ... fallback ... } float foo (float f) { return compute_sth (f); } and if you not only want to dispatch for target features you could specify a selector function and value in the attribute. You might notice that the above eventually matches the target attribute directly, just the frontends need to be taught to emit dispatch code whenever overload resolution results in ambiguities involving target attribute differences. Now, a language extension to support multi-versioning should be completely independent on any IL representation - with using a builtin you are tying them together with the only convenient mechanism we have - a mechanism that isn't optimal for either side IMNSHO. My point is that such optimization is completely independent of that dispatch thing. The
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Tue, May 3, 2011 at 12:00 PM, Richard Guenther richard.guent...@gmail.com wrote: 3) it limits the lowering into one form which may not be ideal -- with builtin_dispatch, after hoisting optimization, the lowering can use more efficient IFUNC scheme, for instance. I see no reason why we cannot transform a switch-indirect-call pattern into an IFUNC call. It is possible -- but it is like asking user to lower the dispatch and tell compiler to raise it again .. There is no possibility for a high-level dispatch at the source level. And if I'd have to design one I would use function overloading, like float compute_sth (float) __attribute__((version(sse4))) { ... sse4 code ... } float compute_sth (float) { ... fallback ... } float foo (float f) { return compute_sth (f); } and if you not only want to dispatch for target features you could specify a selector function and value in the attribute. You might notice that the above eventually matches the target attribute directly, just the frontends need to be taught to emit dispatch code whenever overload resolution results in ambiguities involving target attribute differences. Which also would allow the frontend to directly call the sse4 variant if you compile with -msse4, avoiding the need for any fancy processing. Richard.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Tue, 3 May 2011, Mike Stump wrote: And to go one step further, if we had this, we could use this to define all data manipulation machine built-ins as generic functions, available to all compiles as normal c code, so portable code could use them everywhere, and on platforms that had instructions for any of them, they could define them as normal built-ins. I don't think the forms in which some of the machine-specific built-in functions exist are particularly good for being available everywhere - they are typically defined to correspond exactly to the defined semantics of a particular instruction, complete with e.g. the peculiarities of how overflow works for that instruction. But: * I think it does make sense to make a range of vector operations, saturating operations (see the discussion in PR 48580) etc. available as generic GNU C on all platforms, with generically defined semantics (consistent with how C generally handles things such as overflow), where instruction sets commonly have relevant instructions but it isn't readily possible to represent the operations with generic C and pattern-match them. (Just as we provide operations such as __builtin_popcount and __builtin_clz, for example - and note that __builtin_clz has undefined value at 0 rather than being defined to match a machine instruction.) * Once the operations have generic GNU C, GIMPLE and RTL descriptions, intrinsic headers should best use the GNU C representations if possible instead of calling built-in functions, and any built-in functions should expand to the generic GIMPLE and RTL instead of using UNSPECs. * It may also make sense for a target architecture to define its intrinsics in such a way that they are available on that architecture even when the relevant instructions aren't available in a particular subarchitecture. This may be defining the intrinsics in a header if those are the public interface, instead of the built-in functions. See a thread starting at http://gcc.gnu.org/ml/gcc/2010-11/msg00546.html for some discussion of doing this for SSE, for example. -- Joseph S. Myers jos...@codesourcery.com
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Tue, May 3, 2011 at 3:00 AM, Richard Guenther richard.guent...@gmail.com wrote: On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:33 PM, Richard Guenther richard.guent...@gmail.com wrote: On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. I don't think so. With your patch only two passes understand the new high-level form, the rest of the gimple passes are just confused. There is no need for other passes to understand it -- just treat it as opaque calls. This is goodness otherwise other passes need to be modified. This is true (only some passes understand it) for things like __builtin_expect. Certainly __builtin_dispatch has to be understood by alias analysis and all other passes that care about calls (like all IPA passes). You can of course treat it conservatively (may call any function, even those which have their address not taken, clobber and read all memory, even that which doesn't escape the TU). Why obfuscate things when it is not necessary? MVed functions are usually non-trivial, so I doubt anything will be lost due to the obfuscation. It won't be too difficult to teach aliaser to 'merge' the attributes from target functions either. No that is not my argument. What I tried to say is it will be harder to achieve without high level semantics -- it requires more handshaking between compiler passes. Sure - that's life. We are looking at improving the life .. Which nobody will see benefit from unless they rewrite their code? The target users for the builtin include compiler itself -- it can synthesize dispatch calls. Hum. I'm not at all sure the dispatch calls are the best representation for the IL. The intension is to provide an interface at both C level (for programmers) and IL level. It does not have to be a builtin (both internally and externally) -- but it needs to map to some language construct. Well, I say if we can improve _some_ of the existing usages that's better than never doing wrong on a new language extension. This is independent. It is not. One that I'm not convinced is the way to go (you didn't address at all the inability to use float arguments and the ABI issues with using variadic arguments - after all you did a poor-mans language extension by using GCC builtins instead of inventing a true one). This is an independent issue that either needs to be addressed or marked as limitation. The key of the debate is whether source/IR annotation using construct with high level semantics helps optimizer. In fact this is common. Would it make any difference (in terms of acceptance) if the builtin is only used internally by the compiler and not exposed to the user? No. I don't see at all why having everything in a single stmt is so much more convenient. And I don't see why existing IL features cannot be used to make things a little more convenient. Why not? The high level construct is simpler to deal with. It is all about doing the right optimization at the right level of abstraction. Set aside the question whether using builtin for MV dispatch is the right high level construct, looking at gcc, we can find that gcc's IR is pretty low level resulting in missing optimizations. For instance, there is no high level doloop representation -- Fortran do-loop needs to be lowered and raised back again -- the consequence is that you may not raise the loop nest into the way it was originally written -- perfect nested loop become non-perfect loop nest -- blocking certain loop transformations. Not only that, I am not sure it is even possible to record any loop level information anywhere -- is it possible to have per loop attribute such as unroll factor? Assuming gcc can do full math function inlining (for common math routines) -- in this case, do we still want to do sin/cos optimization or rely on the scalar optimizer to optimize the inlined copies of sin and cos? Not sure about gcc, I remember that dead temporary variable removal can be very hard to do if some intrinsic gets lowered too early introducing
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. 2) Most importantly, the above non standard approaches block interprocedural optimizations such as inlining. With the introduction of buitlin_dispatch and the clear semantics, compiler can do more aggressive optimization. I don't think so, but the previous mail lacked detail on why the new scheme would be better. Multi-way dispatch will be good, but most of the use cases we have seen is 2-way -- a fast path + a default path. Who are the targeted consumer of the feature? 1) For developers who like to MV function manually; 2) For user directed function cloning e.g, int foo (...) __attribute__((clone (sse)): 3) Automatic function cloning determined by compiler. I am not sure that combining the function choice and its invocation to a single builtin is good. First, you use variadic arguments for the actual function arguments which will cause vararg promotions to apply and thus it will be impossible to dispatch to functions taking single precision float values (and dependent on ABI details many more similar issues). Second, you restrict yourself to only two versions of a function - that looks quite limited and this restriction is not easy to lift with your proposed interface. See above. Multi-way dispatch can be added similarly. Not with the specified syntax. So you'd end up with _two_ language extensions. That's bad (and unacceptable, IMHO). That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Thus, please split this piece out into a separate patch and use the const attribute. Sounds good. What happens with cloning, -fclone-hot-version-paths ? - Now, here you lost me somewhat, because I didn't look into the patch details and I am missing an example on how the lowered IL looks before that cloning. So for now I suppose this -fclone-hot-version-paths is to expose direct calls to the IL. If you would lower __builtin_dispatch to a style like int sel = selector (); switch (sel) { case 0: fn = popcnt; break; case 1: fn = popcnt_sse4; break; } res = (*fn) (25); then rather than a new pass specialized for __builtin_dispatch existing optimization passes that are able to do tracing like VRP and DOM via jump-threading or the tracer pass should be able to do this optimization for you. If they do not use FDO in a good way it is better to improve them for this case instead of writing a new pass. What you describe may not work 1) the function selection may happen in a different function; Well, sure. I propose to have the above as lowered form. If people deliberately obfuscate code it's their fault. Your scheme simply makes that obfuscation impossible (or less likely possible). So 1) is not a valid argument. 2) Compiler will need to hoist the selection into the program initializer to avoid overhead ? Isn't that the point of the const function call optimization which I said was a good thing anyway? So, after that it would be int sel = some_global_static; ... As an example of why dispatch hoisting and call chain cloning is needed: void foo(); void bar(); void doit_v1(); void doit_v2(); bool check_v () __attribute__((const)); int test(); void bar() { for (.) { foo(); } } void foo() { ... for (...) { __builtin_dispatch(check_v, doit_v1, doit_v2,...); ... } } int test () { .. bar (); } The feature testing and dispatch is embedded in a 2-deep loop nest crossing function boundaries. The call paths test -- bar -- foo needs to be cloned. This is done by hoisting dispatch up the call chain -- it ends up : void bar_v1() { for (..) { foo_v1 (); } .. } void bar_v2 () { ... for (..) { foo_v2(); } .. } void foo_v1 () { .. for ()
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. See above. Multi-way dispatch can be added similarly. Not with the specified syntax. So you'd end up with _two_ language extensions. That's bad (and unacceptable, IMHO). This part can be improved. That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Thus, please split this piece out into a separate patch and use the const attribute. Sounds good. 1) the function selection may happen in a different function; Well, sure. I propose to have the above as lowered form. If people deliberately obfuscate code it's their fault. Your scheme simply makes that obfuscation impossible (or less likely possible). So 1) is not a valid argument. Lowering too early may defeat the purpose because 1) the desired optimization may not happen subject to many compiler heuristic changes; 2) it has other side effects such as wrong estimation of function size which may impact inlining 3) it limits the lowering into one form which may not be ideal -- with builtin_dispatch, after hoisting optimization, the lowering can use more efficient IFUNC scheme, for instance. My point is that such optimization is completely independent of that dispatch thing. The above could as well be a selection to use different input arrays, one causing alias analysis issues and one not. Thus, a __builtin_dispatch centric optimization pass is the wrong way to go. I agree that many things can implemented in different ways, but a high level standard builtin_dispatch mechanism doing hoisting interprocedcurally is cleaner and simpler and targeted for function multi-versioning -- it does not depend on/rely on later phase's heuristic tunings to make the right things to happen. Function MV deserves this level of treatment as it will become more and more important for some users (e.g., Google). Now, with FDO I'd expect the foo is inlined into bar (because foo is deemed hot), That is a myth -- the truth is that there are other heuristics which can prevent this from happening. then you only need to deal with loop unswitching, which should be easy to drive from FDO. Same here -- the loop body may not be well formed/recognized. The loop nests may not be perfectly nested, or other unswitching heuristics may block it from happening. This is the common problem form many other things that get lowered too early. It is cleaner to make the high level transformation first in IPA, and let unswitching dealing with intra-procedural optimization. Thanks, David Richard. Thanks, David Note that we already have special FDO support for indirect to direct call promotion, so that might work already. Thus, with all this, __builtin_dispatch would be more like syntactic sugar to avoid writing above switch statement yourself. If you restrict that sugar to a constant number of candidates it can be replaced by a macro (or several ones, specialized for the number of candidates). Richard. For the call graph that looks like this before cloning : func_cold func_hot findOnes IsPopCntSupported popcnt | - no_popcnt where popcnt and no_popcnt are the multi-versioned functions, then after cloning : func_cold IsPopCntSupported func_hot_clone0 findOnes_clone0 popcnt | func_hot_clone1 findOnes_clone1 no_popcnt Flags : -fclone-hot-version-paths does function unswitching via cloning. --param=num-mversn-clones=num allows to specify the number of functions that should be cloned.
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
Hi, I want to submit this patch to google/main to make sure it is available for our internal use at Google in order to materialize some optimization opportunities. Let us continue this dicussion as I make changes and submit this for review for trunk. Thanks, -Sri. On Mon, May 2, 2011 at 9:41 AM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. See above. Multi-way dispatch can be added similarly. Not with the specified syntax. So you'd end up with _two_ language extensions. That's bad (and unacceptable, IMHO). This part can be improved. That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Thus, please split this piece out into a separate patch and use the const attribute. Sounds good. 1) the function selection may happen in a different function; Well, sure. I propose to have the above as lowered form. If people deliberately obfuscate code it's their fault. Your scheme simply makes that obfuscation impossible (or less likely possible). So 1) is not a valid argument. Lowering too early may defeat the purpose because 1) the desired optimization may not happen subject to many compiler heuristic changes; 2) it has other side effects such as wrong estimation of function size which may impact inlining 3) it limits the lowering into one form which may not be ideal -- with builtin_dispatch, after hoisting optimization, the lowering can use more efficient IFUNC scheme, for instance. My point is that such optimization is completely independent of that dispatch thing. The above could as well be a selection to use different input arrays, one causing alias analysis issues and one not. Thus, a __builtin_dispatch centric optimization pass is the wrong way to go. I agree that many things can implemented in different ways, but a high level standard builtin_dispatch mechanism doing hoisting interprocedcurally is cleaner and simpler and targeted for function multi-versioning -- it does not depend on/rely on later phase's heuristic tunings to make the right things to happen. Function MV deserves this level of treatment as it will become more and more important for some users (e.g., Google). Now, with FDO I'd expect the foo is inlined into bar (because foo is deemed hot), That is a myth -- the truth is that there are other heuristics which can prevent this from happening. then you only need to deal with loop unswitching, which should be easy to drive from FDO. Same here -- the loop body may not be well formed/recognized. The loop nests may not be perfectly nested, or other unswitching heuristics may block it from happening. This is the common problem form many other things that get lowered too early. It is cleaner to make the high level transformation first in IPA, and let unswitching dealing with intra-procedural optimization. Thanks, David Richard. Thanks, David Note that we already have special FDO support for indirect to direct call promotion, so that might work already. Thus, with all this, __builtin_dispatch would be more like syntactic sugar to avoid writing above switch statement yourself. If you restrict that sugar to a constant number of candidates it can be replaced by a macro (or several ones, specialized for the number of candidates). Richard. For the call graph that looks like this before cloning : func_cold func_hot findOnes IsPopCntSupported popcnt | - no_popcnt where popcnt and no_popcnt are the multi-versioned functions, then after cloning : func_cold IsPopCntSupported
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
Ok. There may be more correctness related comments -- but those can be addressed when available. For trunk, you need to address issues such as multi-way dispatch. Thanks, David On Mon, May 2, 2011 at 12:11 PM, Sriraman Tallam tmsri...@google.com wrote: Hi, I want to submit this patch to google/main to make sure it is available for our internal use at Google in order to materialize some optimization opportunities. Let us continue this dicussion as I make changes and submit this for review for trunk. Thanks, -Sri. On Mon, May 2, 2011 at 9:41 AM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. See above. Multi-way dispatch can be added similarly. Not with the specified syntax. So you'd end up with _two_ language extensions. That's bad (and unacceptable, IMHO). This part can be improved. That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Thus, please split this piece out into a separate patch and use the const attribute. Sounds good. 1) the function selection may happen in a different function; Well, sure. I propose to have the above as lowered form. If people deliberately obfuscate code it's their fault. Your scheme simply makes that obfuscation impossible (or less likely possible). So 1) is not a valid argument. Lowering too early may defeat the purpose because 1) the desired optimization may not happen subject to many compiler heuristic changes; 2) it has other side effects such as wrong estimation of function size which may impact inlining 3) it limits the lowering into one form which may not be ideal -- with builtin_dispatch, after hoisting optimization, the lowering can use more efficient IFUNC scheme, for instance. My point is that such optimization is completely independent of that dispatch thing. The above could as well be a selection to use different input arrays, one causing alias analysis issues and one not. Thus, a __builtin_dispatch centric optimization pass is the wrong way to go. I agree that many things can implemented in different ways, but a high level standard builtin_dispatch mechanism doing hoisting interprocedcurally is cleaner and simpler and targeted for function multi-versioning -- it does not depend on/rely on later phase's heuristic tunings to make the right things to happen. Function MV deserves this level of treatment as it will become more and more important for some users (e.g., Google). Now, with FDO I'd expect the foo is inlined into bar (because foo is deemed hot), That is a myth -- the truth is that there are other heuristics which can prevent this from happening. then you only need to deal with loop unswitching, which should be easy to drive from FDO. Same here -- the loop body may not be well formed/recognized. The loop nests may not be perfectly nested, or other unswitching heuristics may block it from happening. This is the common problem form many other things that get lowered too early. It is cleaner to make the high level transformation first in IPA, and let unswitching dealing with intra-procedural optimization. Thanks, David Richard. Thanks, David Note that we already have special FDO support for indirect to direct call promotion, so that might work already. Thus, with all this, __builtin_dispatch would be more like syntactic sugar to avoid writing above switch statement yourself. If you restrict that sugar to a constant number of candidates it can be replaced by a macro (or several ones, specialized for the number of candidates). Richard. For the call graph that looks like this before cloning : func_cold func_hot findOnes
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. I don't think so. With your patch only two passes understand the new high-level form, the rest of the gimple passes are just confused. See above. Multi-way dispatch can be added similarly. Not with the specified syntax. So you'd end up with _two_ language extensions. That's bad (and unacceptable, IMHO). This part can be improved. That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Thus, please split this piece out into a separate patch and use the const attribute. Sounds good. 1) the function selection may happen in a different function; Well, sure. I propose to have the above as lowered form. If people deliberately obfuscate code it's their fault. Your scheme simply makes that obfuscation impossible (or less likely possible). So 1) is not a valid argument. Lowering too early may defeat the purpose because 1) the desired optimization may not happen subject to many compiler heuristic changes; 2) it has other side effects such as wrong estimation of function size which may impact inlining May, may ... so you say all this can't happen under any circumstance with your special code and passes? Which nobody will see benefit from unless they rewrite their code? Well, I say if we can improve _some_ of the existing usages that's better than never doing wrong on a new language extension. One that I'm not convinced is the way to go (you didn't address at all the inability to use float arguments and the ABI issues with using variadic arguments - after all you did a poor-mans language extension by using GCC builtins instead of inventing a true one). 3) it limits the lowering into one form which may not be ideal -- with builtin_dispatch, after hoisting optimization, the lowering can use more efficient IFUNC scheme, for instance. I see no reason why we cannot transform a switch-indirect-call pattern into an IFUNC call. My point is that such optimization is completely independent of that dispatch thing. The above could as well be a selection to use different input arrays, one causing alias analysis issues and one not. Thus, a __builtin_dispatch centric optimization pass is the wrong way to go. I agree that many things can implemented in different ways, but a high level standard builtin_dispatch mechanism doing hoisting interprocedcurally is cleaner and simpler and targeted for function multi-versioning -- it does not depend on/rely on later phase's heuristic tunings to make the right things to happen. Function MV deserves this level of treatment as it will become more and more important for some users (e.g., Google). But inventing a new language extension to benefit from whatever improvements we implement isn't the obviously best choice. Now, with FDO I'd expect the foo is inlined into bar (because foo is deemed hot), That is a myth -- the truth is that there are other heuristics which can prevent this from happening. Huh, sure. That doesn't make my expectation a myth. then you only need to deal with loop unswitching, which should be easy to drive from FDO. Same here -- the loop body may not be well formed/recognized. The loop nests may not be perfectly nested, or other unswitching heuristics may block it from happening. This is the common problem form many other things that get lowered too early. It is cleaner to make the high level transformation first in IPA, and let unswitching dealing with intra-procedural optimization. If it's not well-formed inlining the call does not make it well-formed and thus it won't be optimized well anyway. Btw, for the usual cases I
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Mon, May 2, 2011 at 2:33 PM, Richard Guenther richard.guent...@gmail.com wrote: On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote: On Mon, May 2, 2011 at 2:11 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote: Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; You are not doing that, you are inventing a new (crude) GCC extension. To capture the high level semantics and prevent user from lowering the dispatch calls into forms compiler can not recognize, language extension is the way to go. I don't think so. With your patch only two passes understand the new high-level form, the rest of the gimple passes are just confused. There is no need for other passes to understand it -- just treat it as opaque calls. This is goodness otherwise other passes need to be modified. This is true (only some passes understand it) for things like __builtin_expect. 1) the desired optimization may not happen subject to many compiler heuristic changes; 2) it has other side effects such as wrong estimation of function size which may impact inlining May, may ... so you say all this can't happen under any circumstance with your special code and passes? No that is not my argument. What I tried to say is it will be harder to achieve without high level semantics -- it requires more handshaking between compiler passes. Which nobody will see benefit from unless they rewrite their code? The target users for the builtin include compiler itself -- it can synthesize dispatch calls. Well, I say if we can improve _some_ of the existing usages that's better than never doing wrong on a new language extension. This is independent. One that I'm not convinced is the way to go (you didn't address at all the inability to use float arguments and the ABI issues with using variadic arguments - after all you did a poor-mans language extension by using GCC builtins instead of inventing a true one). This is an independent issue that either needs to be addressed or marked as limitation. The key of the debate is whether source/IR annotation using construct with high level semantics helps optimizer. In fact this is common. Would it make any difference (in terms of acceptance) if the builtin is only used internally by the compiler and not exposed to the user? 3) it limits the lowering into one form which may not be ideal -- with builtin_dispatch, after hoisting optimization, the lowering can use more efficient IFUNC scheme, for instance. I see no reason why we cannot transform a switch-indirect-call pattern into an IFUNC call. It is possible -- but it is like asking user to lower the dispatch and tell compiler to raise it again .. My point is that such optimization is completely independent of that dispatch thing. The above could as well be a selection to use different input arrays, one causing alias analysis issues and one not. Thus, a __builtin_dispatch centric optimization pass is the wrong way to go. I agree that many things can implemented in different ways, but a high level standard builtin_dispatch mechanism doing hoisting interprocedcurally is cleaner and simpler and targeted for function multi-versioning -- it does not depend on/rely on later phase's heuristic tunings to make the right things to happen. Function MV deserves this level of treatment as it will become more and more important for some users (e.g., Google). But inventing a new language extension to benefit from whatever improvements we implement isn't the obviously best choice. It is not for any improvement. I mentioned the potential for function MV and want to have a compiler infrastructure to deal with it. Now, with FDO I'd expect the foo is inlined into bar (because foo is deemed hot), That is a myth -- the truth is that there are other heuristics which can prevent this from happening. Huh, sure. That doesn't make my expectation a myth. then you only need to deal with loop unswitching, which should be easy to drive from FDO. Same here -- the loop body may not be well formed/recognized. The loop nests may not be perfectly nested, or other unswitching heuristics may block it from happening. This is the common problem form many other things that get lowered too early. It is cleaner to make the high level transformation first in IPA, and let unswitching dealing with intra-procedural optimization. If it's not well-formed inlining the call does not make it well-formed and thus it won't be
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam tmsri...@google.com wrote: I want this patch to be considered for google/main now. This is intended to be submitted to trunk for review soon. This patch has been tested with crosstool bootstrap using buildit and by running all tests. Patch Description : == Frequently executed functions in programs are some times compiled into different versions to take advantage of some specific advanced features present in some of the types of platforms on which the program is executed. For instance, SSE4 versus no-SSE4. Existing techniques to do multi-versioning, for example using indirect calls, block compiler optimizations, mainly inlining. The general idea of supporting versioning is good, thanks for working on it. Comments below. This patch adds a new GCC pass to support multi-versioned calls through a new builtin, __builtin_dispatch. The __builtin_dispatch call is converted into a simple if-then construct to call the appropriate version at run-time. There are no indirect calls so inlining is not affected. I am not sure that combining the function choice and its invocation to a single builtin is good. First, you use variadic arguments for the actual function arguments which will cause vararg promotions to apply and thus it will be impossible to dispatch to functions taking single precision float values (and dependent on ABI details many more similar issues). Second, you restrict yourself to only two versions of a function - that looks quite limited and this restriction is not easy to lift with your proposed interface. Thus, I would have kept regular (but indirect) calls in the source program and only exposed the dispatch via a builtin, like ... This patch also supports a function unswitching optimization via cloning of functions, only along hot paths, that call multi-versioned functions so that the hot multi-versioned paths can be independently optimized for performance. The cloning optimization is switched on via a flag, -fclone-hot-version-paths. Only two versions are supported in this patch. Example use : int popcnt_sse4(unsigned int x) __attribute__((__target__(popcnt))); int popcnt_sse4(unsigned int x) { int count = __builtin_popcount(x); return count; } int popcnt(unsigned int x) __attribute__((__target__(no-popcnt))); int popcnt(unsigned int x) { int count = __builtin_popcount(x); return count; } int testsse() __attribute__((version_selector)); int main () { ... /* The multi-versioned call. */ ret = __builtin_dispatch (testsse, (void*)popcnt_sse4, (void*)popcnt, 25); ... } int main() { int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse, popcnt_sse4, popcnt); ret = (*ppcnt) (25); } which would allow the testsse function to return the argument number of the function to select. [snip] When to use the version_selector attribute ? --- Functions are marked with attribute version_selector only if they are run-time constants. Example of such functions would be those that test if a specific feature is available on a particular architecture. Such functions must return a positive integer. For two-way functions, those that test if a feature is present or not must return 1 or 0 respectively. This patch will make constructors that call these function once and assign the outcome to a global variable. From then on, only the value of the global will be used to decide which version to execute. That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Thus, please split this piece out into a separate patch and use the const attribute. What happens with cloning, -fclone-hot-version-paths ? - Now, here you lost me somewhat, because I didn't look into the patch details and I am missing an example on how the lowered IL looks before that cloning. So for now I suppose this -fclone-hot-version-paths is to expose direct calls to the IL. If you would lower __builtin_dispatch to a style like int sel = selector (); switch (sel) { case 0: fn = popcnt; break; case 1: fn = popcnt_sse4; break; } res = (*fn) (25); then rather than a new pass specialized for __builtin_dispatch existing optimization passes that are able to do tracing like VRP and DOM via jump-threading or the tracer pass should be able to do this optimization for you. If they do not
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
Here is the background for this feature: 1) People relies on function multi-version to explore hw features and squeeze performance, but there is no standard ways of doing so, either a) using indirect function calls with function pointers set at program initialization; b) using manual dispatch at each callsite; b) using features like IFUNC. The dispatch mechanism needs to be promoted to the language level and becomes the first class citizen; 2) Most importantly, the above non standard approaches block interprocedural optimizations such as inlining. With the introduction of buitlin_dispatch and the clear semantics, compiler can do more aggressive optimization. Multi-way dispatch will be good, but most of the use cases we have seen is 2-way -- a fast path + a default path. Who are the targeted consumer of the feature? 1) For developers who like to MV function manually; 2) For user directed function cloning e.g, int foo (...) __attribute__((clone (sse)): 3) Automatic function cloning determined by compiler. I am not sure that combining the function choice and its invocation to a single builtin is good. First, you use variadic arguments for the actual function arguments which will cause vararg promotions to apply and thus it will be impossible to dispatch to functions taking single precision float values (and dependent on ABI details many more similar issues). Second, you restrict yourself to only two versions of a function - that looks quite limited and this restriction is not easy to lift with your proposed interface. See above. Multi-way dispatch can be added similarly. That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Thus, please split this piece out into a separate patch and use the const attribute. Sounds good. What happens with cloning, -fclone-hot-version-paths ? - Now, here you lost me somewhat, because I didn't look into the patch details and I am missing an example on how the lowered IL looks before that cloning. So for now I suppose this -fclone-hot-version-paths is to expose direct calls to the IL. If you would lower __builtin_dispatch to a style like int sel = selector (); switch (sel) { case 0: fn = popcnt; break; case 1: fn = popcnt_sse4; break; } res = (*fn) (25); then rather than a new pass specialized for __builtin_dispatch existing optimization passes that are able to do tracing like VRP and DOM via jump-threading or the tracer pass should be able to do this optimization for you. If they do not use FDO in a good way it is better to improve them for this case instead of writing a new pass. What you describe may not work 1) the function selection may happen in a different function; 2) Compiler will need to hoist the selection into the program initializer to avoid overhead As an example of why dispatch hoisting and call chain cloning is needed: void foo(); void bar(); void doit_v1(); void doit_v2(); bool check_v () __attribute__((const)); int test(); void bar() { for (.) { foo(); } } void foo() { ... for (...) { __builtin_dispatch(check_v, doit_v1, doit_v2,...); ... } } int test () { .. bar (); } The feature testing and dispatch is embedded in a 2-deep loop nest crossing function boundaries. The call paths test -- bar -- foo needs to be cloned. This is done by hoisting dispatch up the call chain -- it ends up : void bar_v1() { for (..) { foo_v1 (); } .. } void bar_v2 () { ... for (..) { foo_v2(); } .. } void foo_v1 () { .. for () { doit_v1() } } void foo_v2 () { .. for () { doit_v2() } } int test () { __builtin_dispatch(check_v, bar_v1, bar_v2); .. } Thanks, David Note that we already have special FDO support for indirect to direct call promotion, so that might work already. Thus, with all this, __builtin_dispatch would be more like syntactic sugar to avoid writing above switch statement yourself. If you restrict that sugar to a constant number of candidates it can be replaced by a macro (or several ones, specialized for the number of candidates). Richard. For the call graph that looks like this before cloning : func_cold func_hot findOnes IsPopCntSupported popcnt |
Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)
Hi Richard, Thanks for the comments. Please find inline responses. On Fri, Apr 29, 2011 at 1:56 AM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam tmsri...@google.com wrote: I want this patch to be considered for google/main now. This is intended to be submitted to trunk for review soon. This patch has been tested with crosstool bootstrap using buildit and by running all tests. Patch Description : == Frequently executed functions in programs are some times compiled into different versions to take advantage of some specific advanced features present in some of the types of platforms on which the program is executed. For instance, SSE4 versus no-SSE4. Existing techniques to do multi-versioning, for example using indirect calls, block compiler optimizations, mainly inlining. The general idea of supporting versioning is good, thanks for working on it. Comments below. This patch adds a new GCC pass to support multi-versioned calls through a new builtin, __builtin_dispatch. The __builtin_dispatch call is converted into a simple if-then construct to call the appropriate version at run-time. There are no indirect calls so inlining is not affected. I am not sure that combining the function choice and its invocation to a single builtin is good. First, you use variadic arguments for the actual function arguments which will cause vararg promotions to apply and thus it will be impossible to dispatch to functions taking single precision float values (and dependent on ABI details many more similar issues). Second, you restrict yourself to only two versions of a function - that looks quite limited and this restriction is not easy to lift with your proposed interface. Thus, I would have kept regular (but indirect) calls in the source program and only exposed the dispatch via a builtin, like ... This patch also supports a function unswitching optimization via cloning of functions, only along hot paths, that call multi-versioned functions so that the hot multi-versioned paths can be independently optimized for performance. The cloning optimization is switched on via a flag, -fclone-hot-version-paths. Only two versions are supported in this patch. Example use : int popcnt_sse4(unsigned int x) __attribute__((__target__(popcnt))); int popcnt_sse4(unsigned int x) { int count = __builtin_popcount(x); return count; } int popcnt(unsigned int x) __attribute__((__target__(no-popcnt))); int popcnt(unsigned int x) { int count = __builtin_popcount(x); return count; } int testsse() __attribute__((version_selector)); int main () { ... /* The multi-versioned call. */ ret = __builtin_dispatch (testsse, (void*)popcnt_sse4, (void*)popcnt, 25); ... } int main() { int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse, popcnt_sse4, popcnt); ret = (*ppcnt) (25); } which would allow the testsse function to return the argument number of the function to select. [snip] When to use the version_selector attribute ? --- Functions are marked with attribute version_selector only if they are run-time constants. Example of such functions would be those that test if a specific feature is available on a particular architecture. Such functions must return a positive integer. For two-way functions, those that test if a feature is present or not must return 1 or 0 respectively. This patch will make constructors that call these function once and assign the outcome to a global variable. From then on, only the value of the global will be used to decide which version to execute. That's a nice idea, but why not simply process all functions with a const attribute and no arguments this way? IMHO int testsse (void) __attribute__((const)); is as good and avoids the new attribute completely. The pass would also benefit other uses of this idiom (giving a way to have global dynamic initializers in C). The functions may not be strictly 'const' in a way the compiler autodetects this attribute but it presents exactly the promises to the compiler that allow this optimization. Since, const functions cannot, not according to the definition atleast , read files or memory, I was thinking along the lines of making it a pure function (For example, The version_selector function could be reading /proc/cpuinfo and looking for certain features). The reason I invented a new attribute was because I assumed it is not legal to hoist the call site of a pure function into a constructor and substitute the value computed simply because it may not be valid to call the pure function ahead of the intended time in the execution. Basically, the semantics of the version_selector is that the value returned is a run-time constant and that it can be