Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-07 Thread Richard Guenther
On Fri, May 6, 2011 at 7:57 PM, Xinliang David Li davi...@google.com wrote:
 I want propose a more general solution.

 1) Generic Annotation Support for gcc IR -- it is used attach to
 application/optimization specific annotation to gimple statements and
 annotations can be passed around across passes. In gcc, I only see
 HISTOGRAM annotation for value profiling, which is not general enough
 2) Support of CallInfo for each callsite. This is an annotation, but
 more standardized. The callinfo can be used to record information such
 as call attributes, call side effects, mod-ref information etc ---
 current gimple_call_flags can be folded into this Info structure.

 I don't like generic annotation facilities.  What should passes to with
 annotated stmts that are a) transformed, b) removed?  See RTL notes
 and all the interesting issues they cause.


 Then how do you store information that needs to be passed across
 optimization passes -- you can not possibly dump all of them into the
 core IR. In fact, anything that is derived from (via analysis) but not
 part of the core IR need to worry about update and maintenance. In
 current GIMPLE, we can find many such instances -- DU chains, Memory
 SSA, control flow information, as well as flags like visited,
 no_warning, PLF (?), etc. Have a unified way of representing them is a
 good thing so that 1) make the IR lean and mean; 2) avoid too many
 different side data structures.  The important thing is to have a good
 verifier to catch insanity and inconsistency of the annotation after
 each pass.

Well, as soon as you have a verifier it's no longer generic but _is_
part of the core IL.  Similar to how EH info is part of the core IL,
or alias info, or loop information.

Richard.


Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-07 Thread Xinliang David Li
On Sat, May 7, 2011 at 5:46 AM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Fri, May 6, 2011 at 7:57 PM, Xinliang David Li davi...@google.com wrote:
 I want propose a more general solution.

 1) Generic Annotation Support for gcc IR -- it is used attach to
 application/optimization specific annotation to gimple statements and
 annotations can be passed around across passes. In gcc, I only see
 HISTOGRAM annotation for value profiling, which is not general enough
 2) Support of CallInfo for each callsite. This is an annotation, but
 more standardized. The callinfo can be used to record information such
 as call attributes, call side effects, mod-ref information etc ---
 current gimple_call_flags can be folded into this Info structure.

 I don't like generic annotation facilities.  What should passes to with
 annotated stmts that are a) transformed, b) removed?  See RTL notes
 and all the interesting issues they cause.


 Then how do you store information that needs to be passed across
 optimization passes -- you can not possibly dump all of them into the
 core IR. In fact, anything that is derived from (via analysis) but not
 part of the core IR need to worry about update and maintenance. In
 current GIMPLE, we can find many such instances -- DU chains, Memory
 SSA, control flow information, as well as flags like visited,
 no_warning, PLF (?), etc. Have a unified way of representing them is a
 good thing so that 1) make the IR lean and mean; 2) avoid too many
 different side data structures.  The important thing is to have a good
 verifier to catch insanity and inconsistency of the annotation after
 each pass.

 Well, as soon as you have a verifier it's no longer generic but _is_
 part of the core IL.  Similar to how EH info is part of the core IL,
 or alias info, or loop information.

The difference is that no-core IL can be thrown away anytime and be
recomputed if needed.

David



 Richard.



Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-06 Thread Richard Guenther
On Thu, May 5, 2011 at 7:02 PM, Xinliang David Li davi...@google.com wrote:
 On Thu, May 5, 2011 at 2:16 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Thu, May 5, 2011 at 12:19 AM, Xinliang David Li davi...@google.com 
 wrote:

 I can think of some more-or-less obvious high-level forms, one would
 for example simply stick a new DISPATCH tree into gimple_call_fn
 (similar to how we can have OBJ_TYPE_REF there), the DISPATCH
 tree would be of variable length, first operand the selector function
 and further operands function addresses.  That would keep the
 actual call visible (instead of a fake __builtin_dispatch call), something
 I'd really like to see.

 This sounds like a good long term solution.

 Thinking about it again maybe, similar to OBJ_TYPE_REF, have the
 selection itself lowered and only keep the set of functions as
 additional info.  Thus instead of having the selector function as
 first operand have a pointer to the selected function there (that also
 avoids too much knowledge about the return value of the selector).
 Thus,

  sel = selector ();
  switch (sel)
   {
   case A: fn = bar;
   case B: fn = foo;
   }
  val = (*DISPATCH (fn, bar, foo)) (...);

 that way regular optimizations can apply to the selection, eventually
 discard the dispatch if fn becomes a known direct function (similar
 to devirtualization).  At expansion time the call address is simply
 taken from the first operand and an indirect call is assembled.

 Does the above still provide enough knowledge for the IPA path isolation?


 I like your original proposal (extending call) better because related
 information are tied together and is easier to hoist and clean up.

 I want propose a more general solution.

 1) Generic Annotation Support for gcc IR -- it is used attach to
 application/optimization specific annotation to gimple statements and
 annotations can be passed around across passes. In gcc, I only see
 HISTOGRAM annotation for value profiling, which is not general enough
 2) Support of CallInfo for each callsite. This is an annotation, but
 more standardized. The callinfo can be used to record information such
 as call attributes, call side effects, mod-ref information etc ---
 current gimple_call_flags can be folded into this Info structure.

I don't like generic annotation facilities.  What should passes to with
annotated stmts that are a) transformed, b) removed?  See RTL notes
and all the interesting issues they cause.

 Similarly (not related to this discussion), LoopInfo structure can be
 introduced to annotate loop back edge jumps to allow FE to pass useful
 information at loop level. For floating pointer operations, things
 like the precision constraint, sensitivity to floating environment etc
 can be recorded in FPInfo.

Yes, the idea is to keep the loop structures live throughout the whole
compilation.  Just somebody needs to do the last 1% of work.

Richard.

 T


 Restricting ourselves to use the existing target attribute at the
 beginning (with a single, compiler-generated selector function)
 is probably good enough to get a prototype up and running.
 Extending it to arbitrary selector-function, value pairs using a
 new attribute is then probably easy (I don't see the exact use-case
 for that yet, but I suppose it exists if you say so).

 For the use cases, CPU model will be looked at instead of just the
 core architecture -- this will give use more information about the
 numbrer of cores, size of caches etc. Intel's runtime library does
 this checkiing at start up time so that the multi-versioned code can
 look at those and make the appropriate decisions.

 It will be even more complicated for arm processors -- which can have
 the same processor cores but configured differently w.r.t VFP, NEON
 etc.

 Ah, indeed.  I hadn't thought about the tuning for different variants
 as opposed to enabling HW features.  So the interface for overloading
 would be sth like

 enum X { Foo = 0, Bar = 5 };

 enum X select () { return Bar; }

 void foo (void) __attribute__((dispatch(select, Bar)));


 Yes, for overloading -- something like this looks good.

 Thanks,

 David



Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-06 Thread Diego Novillo
On Fri, May 6, 2011 at 04:55, Richard Guenther
richard.guent...@gmail.com wrote:
 On Thu, May 5, 2011 at 7:02 PM, Xinliang David Li davi...@google.com wrote:

 2) Support of CallInfo for each callsite. This is an annotation, but
 more standardized. The callinfo can be used to record information such
 as call attributes, call side effects, mod-ref information etc ---
 current gimple_call_flags can be folded into this Info structure.

 I don't like generic annotation facilities.  What should passes to with
 annotated stmts that are a) transformed, b) removed?  See RTL notes
 and all the interesting issues they cause.

Likewise.  We kind of tried having them in the early days of gimple
and tree-ssa, but quickly removed them.  Anything that is not a
first-class IL member, makes life difficult.  We have some examples in
PHI nodes and EH regions.  They're a bit to the side, and require
extra code to manage.


Diego.


Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-06 Thread Xinliang David Li
 I want propose a more general solution.

 1) Generic Annotation Support for gcc IR -- it is used attach to
 application/optimization specific annotation to gimple statements and
 annotations can be passed around across passes. In gcc, I only see
 HISTOGRAM annotation for value profiling, which is not general enough
 2) Support of CallInfo for each callsite. This is an annotation, but
 more standardized. The callinfo can be used to record information such
 as call attributes, call side effects, mod-ref information etc ---
 current gimple_call_flags can be folded into this Info structure.

 I don't like generic annotation facilities.  What should passes to with
 annotated stmts that are a) transformed, b) removed?  See RTL notes
 and all the interesting issues they cause.


Then how do you store information that needs to be passed across
optimization passes -- you can not possibly dump all of them into the
core IR. In fact, anything that is derived from (via analysis) but not
part of the core IR need to worry about update and maintenance. In
current GIMPLE, we can find many such instances -- DU chains, Memory
SSA, control flow information, as well as flags like visited,
no_warning, PLF (?), etc. Have a unified way of representing them is a
good thing so that 1) make the IR lean and mean; 2) avoid too many
different side data structures.  The important thing is to have a good
verifier to catch insanity and inconsistency of the annotation after
each pass.

Thanks,

David



 Similarly (not related to this discussion), LoopInfo structure can be
 introduced to annotate loop back edge jumps to allow FE to pass useful
 information at loop level. For floating pointer operations, things
 like the precision constraint, sensitivity to floating environment etc
 can be recorded in FPInfo.

 Yes, the idea is to keep the loop structures live throughout the whole
 compilation.  Just somebody needs to do the last 1% of work.

 Richard.

 T


 Restricting ourselves to use the existing target attribute at the
 beginning (with a single, compiler-generated selector function)
 is probably good enough to get a prototype up and running.
 Extending it to arbitrary selector-function, value pairs using a
 new attribute is then probably easy (I don't see the exact use-case
 for that yet, but I suppose it exists if you say so).

 For the use cases, CPU model will be looked at instead of just the
 core architecture -- this will give use more information about the
 numbrer of cores, size of caches etc. Intel's runtime library does
 this checkiing at start up time so that the multi-versioned code can
 look at those and make the appropriate decisions.

 It will be even more complicated for arm processors -- which can have
 the same processor cores but configured differently w.r.t VFP, NEON
 etc.

 Ah, indeed.  I hadn't thought about the tuning for different variants
 as opposed to enabling HW features.  So the interface for overloading
 would be sth like

 enum X { Foo = 0, Bar = 5 };

 enum X select () { return Bar; }

 void foo (void) __attribute__((dispatch(select, Bar)));


 Yes, for overloading -- something like this looks good.

 Thanks,

 David




Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-05 Thread Richard Guenther
On Thu, May 5, 2011 at 12:19 AM, Xinliang David Li davi...@google.com wrote:

 I can think of some more-or-less obvious high-level forms, one would
 for example simply stick a new DISPATCH tree into gimple_call_fn
 (similar to how we can have OBJ_TYPE_REF there), the DISPATCH
 tree would be of variable length, first operand the selector function
 and further operands function addresses.  That would keep the
 actual call visible (instead of a fake __builtin_dispatch call), something
 I'd really like to see.

 This sounds like a good long term solution.

Thinking about it again maybe, similar to OBJ_TYPE_REF, have the
selection itself lowered and only keep the set of functions as
additional info.  Thus instead of having the selector function as
first operand have a pointer to the selected function there (that also
avoids too much knowledge about the return value of the selector).
Thus,

  sel = selector ();
  switch (sel)
   {
   case A: fn = bar;
   case B: fn = foo;
   }
  val = (*DISPATCH (fn, bar, foo)) (...);

that way regular optimizations can apply to the selection, eventually
discard the dispatch if fn becomes a known direct function (similar
to devirtualization).  At expansion time the call address is simply
taken from the first operand and an indirect call is assembled.

Does the above still provide enough knowledge for the IPA path isolation?

 Restricting ourselves to use the existing target attribute at the
 beginning (with a single, compiler-generated selector function)
 is probably good enough to get a prototype up and running.
 Extending it to arbitrary selector-function, value pairs using a
 new attribute is then probably easy (I don't see the exact use-case
 for that yet, but I suppose it exists if you say so).

 For the use cases, CPU model will be looked at instead of just the
 core architecture -- this will give use more information about the
 numbrer of cores, size of caches etc. Intel's runtime library does
 this checkiing at start up time so that the multi-versioned code can
 look at those and make the appropriate decisions.

 It will be even more complicated for arm processors -- which can have
 the same processor cores but configured differently w.r.t VFP, NEON
 etc.

Ah, indeed.  I hadn't thought about the tuning for different variants
as opposed to enabling HW features.  So the interface for overloading
would be sth like

enum X { Foo = 0, Bar = 5 };

enum X select () { return Bar; }

void foo (void) __attribute__((dispatch(select, Bar)));

which either means having pairs of function / select return value in
the DISPATCH operands or having it partly lowered as I outlined
above.

 For the overloading to work we probably have to force that the
 functions are local (so we can mangle them arbitrarily) and that
 if the function should be visible externally people add an
 externally visible dispatcher (foo in the above example would be one).


 For most of the cases, probably only the primary/default version needs
 to be publicly visible ..

Yeah.  And that one we eventually can auto-transform to use IFUNC
relocations.

Richard.


Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-04 Thread Richard Guenther
On Tue, May 3, 2011 at 11:57 PM, Xinliang David Li davi...@google.com wrote:
 On Tue, May 3, 2011 at 3:00 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com 
 wrote:
 On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com 
 wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

 To capture the high level semantics and prevent user from lowering the
 dispatch calls into forms compiler can not recognize, language
 extension is the way to go.

 I don't think so.  With your patch only two passes understand the new
 high-level form, the rest of the gimple passes are just confused.

 There is no need for other passes to understand it -- just treat it as
 opaque calls. This is goodness otherwise other passes need to be
 modified. This is true (only some passes understand it) for things
 like __builtin_expect.

 Certainly __builtin_dispatch has to be understood by alias analysis and
 all other passes that care about calls (like all IPA passes).  You can
 of course treat it conservatively (may call any function, even those
 which have their address not taken, clobber and read all memory, even
 that which doesn't escape the TU).

 Why obfuscate things when it is not necessary?

 MVed functions are usually non-trivial, so I doubt anything will be
 lost due to the obfuscation. It won't be too difficult to teach
 aliaser to 'merge' the attributes from target functions either.


 No that is not my argument. What I tried to say is it will be harder
 to achieve without high level semantics -- it requires more
 handshaking between compiler passes.

 Sure - that's life.


 We are looking at improving the life ..

 Which nobody will see benefit
 from unless they rewrite their code?

 The target users for the builtin include compiler itself -- it can
 synthesize dispatch calls.

 Hum.  I'm not at all sure the dispatch calls are the best representation
 for the IL.


 The intension is to provide an interface at both C level (for
 programmers) and IL level.  It does not have to be a builtin (both
 internally and externally)  -- but it needs to map to some language
 construct.


 Well, I say if we can improve
 _some_ of the existing usages that's better than never doing wrong
 on a new language extension.

 This is independent.

 It is not.

 One that I'm not convinced is the way
 to go (you didn't address at all the inability to use float arguments
 and the ABI issues with using variadic arguments - after all you
 did a poor-mans language extension by using GCC builtins instead
 of inventing a true one).

 This is an independent issue that either needs to be addressed or
 marked as limitation. The key of the debate is whether source/IR
 annotation using construct with high level semantics helps optimizer.
 In fact this is common. Would it make any difference (in terms of
 acceptance) if the builtin is only used internally by the compiler and
 not exposed to the user?

 No.  I don't see at all why having everything in a single stmt is so much
 more convenient.  And I don't see why existing IL features cannot be
 used to make things a little more convenient.

 Why not? The high level construct is simpler to deal with. It is all
 about doing the right optimization at the right level of abstraction.
 Set aside the question whether using builtin for MV dispatch is the
 right high level construct, looking at gcc, we can find that gcc's IR
 is pretty low level resulting in missing optimizations.

 For instance, there is no high level doloop representation -- Fortran
 do-loop needs to be lowered and raised back again -- the consequence
 is that you may not raise the loop nest into the way it was originally
 written -- perfect nested loop become non-perfect loop nest --
 blocking certain loop transformations.  Not only that, I am not sure
 it is even possible to record any loop level information anywhere --
 is it possible to have per loop attribute such as unroll factor?

 Assuming gcc can do full math function inlining (for common math
 routines) -- in this case, do we still want to do sin/cos optimization
 or rely on the scalar optimizer to optimize the inlined copies of sin
 and cos?

 Not sure about gcc, I remember that dead 

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-04 Thread Diego Novillo
On Wed, May 4, 2011 at 15:35, Sriraman Tallam tmsri...@google.com wrote:

        * tree-pass.h (pass_tree_convert_builtin_dispatch): New pass.
        (pass_ipa_multiversion_dispatch): New pass.
        * builtin-types.def (BT_PTR_FN_INT): New pointer type.
        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for 
 __builtin_dispatch.
        * builtins.def (BUILT_IN_DISPATCH): New builtin to
        support multi-version calls.
        * mversn-dispatch.c: New file.
        * timevar.de (TV_MVERSN_DISPATCH): New time var.
        * common.opt (fclone-hot-version-paths): New flag.
        * Makefile.in (mversn-dispatch.o): New rule.
        * passes.c (init_optimization_passes): Add the new
        multi-version and dispatch passes to the pass list.
        * params.def (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
        * doc/invoke.texi (mversn-clone-depth): Document.
        (num-mversn-clones): Document.
        (fclone-hot-version-paths): Document.
        * testsuite/gcc.dg/mversn7.c: New test.
        * testsuite/gcc.dg/mversn4.c: New test.
        * testsuite/gcc.dg/mversn4.h: New test.
        * testsuite/gcc.dg/mversn4a.c: New test.
        * testsuite/gcc.dg/torture/mversn1.c: New test.
        * testsuite/gcc.dg/mversn2.c: New test.
        * testsuite/gcc.dg/mversn6.c: New test.
        * testsuite/gcc.dg/mversn3.c: New test.
        * testsuite/g++.dg/mversn8.C: New test.
        * testsuite/g++.dg/mversn10a.C: New test.
        * testsuite/g++.dg/mversn14a.C: New test.
        * testsuite/g++.dg/tree-prof/mversn13.C: New test.
        * testsuite/g++.dg/tree-prof/mversn15.C: New test.
        * testsuite/g++.dg/tree-prof/mversn15a.C: New test.
        * testsuite/g++.dg/mversn9.C: New test.
        * testsuite/g++.dg/mversn10.C: New test.
        * testsuite/g++.dg/mversn12.C: New test.
        * testsuite/g++.dg/mversn14.C: New test.
        * testsuite/g++.dg/mversn16.C: New test.
        * testsuite/g++.dg/torture/mversn11.C: New test.
        * testsuite/g++.dg/torture/mversn5.C: New test.
        * testsuite/g++.dg/torture/mversn5.h: New test.
        * testsuite/g++.dg/torture/mversn5a.C: New test.
        * c-family/c-common.c (handle_version_selector_attribute): New 
 function.
        (c_common_attribute_table): New attribute version_selector.

OK.  Thanks for the quick fix!


Diego.


Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-04 Thread Sriraman Tallam
I submitted the patch.

Thanks,
-Sri.

On Wed, May 4, 2011 at 3:13 PM, Diego Novillo dnovi...@google.com wrote:
 On Wed, May 4, 2011 at 15:35, Sriraman Tallam tmsri...@google.com wrote:

        * tree-pass.h (pass_tree_convert_builtin_dispatch): New pass.
        (pass_ipa_multiversion_dispatch): New pass.
        * builtin-types.def (BT_PTR_FN_INT): New pointer type.
        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for 
 __builtin_dispatch.
        * builtins.def (BUILT_IN_DISPATCH): New builtin to
        support multi-version calls.
        * mversn-dispatch.c: New file.
        * timevar.de (TV_MVERSN_DISPATCH): New time var.
        * common.opt (fclone-hot-version-paths): New flag.
        * Makefile.in (mversn-dispatch.o): New rule.
        * passes.c (init_optimization_passes): Add the new
        multi-version and dispatch passes to the pass list.
        * params.def (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
        * doc/invoke.texi (mversn-clone-depth): Document.
        (num-mversn-clones): Document.
        (fclone-hot-version-paths): Document.
        * testsuite/gcc.dg/mversn7.c: New test.
        * testsuite/gcc.dg/mversn4.c: New test.
        * testsuite/gcc.dg/mversn4.h: New test.
        * testsuite/gcc.dg/mversn4a.c: New test.
        * testsuite/gcc.dg/torture/mversn1.c: New test.
        * testsuite/gcc.dg/mversn2.c: New test.
        * testsuite/gcc.dg/mversn6.c: New test.
        * testsuite/gcc.dg/mversn3.c: New test.
        * testsuite/g++.dg/mversn8.C: New test.
        * testsuite/g++.dg/mversn10a.C: New test.
        * testsuite/g++.dg/mversn14a.C: New test.
        * testsuite/g++.dg/tree-prof/mversn13.C: New test.
        * testsuite/g++.dg/tree-prof/mversn15.C: New test.
        * testsuite/g++.dg/tree-prof/mversn15a.C: New test.
        * testsuite/g++.dg/mversn9.C: New test.
        * testsuite/g++.dg/mversn10.C: New test.
        * testsuite/g++.dg/mversn12.C: New test.
        * testsuite/g++.dg/mversn14.C: New test.
        * testsuite/g++.dg/mversn16.C: New test.
        * testsuite/g++.dg/torture/mversn11.C: New test.
        * testsuite/g++.dg/torture/mversn5.C: New test.
        * testsuite/g++.dg/torture/mversn5.h: New test.
        * testsuite/g++.dg/torture/mversn5a.C: New test.
        * c-family/c-common.c (handle_version_selector_attribute): New 
 function.
        (c_common_attribute_table): New attribute version_selector.

 OK.  Thanks for the quick fix!


 Diego.



Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-04 Thread Xinliang David Li

 I can think of some more-or-less obvious high-level forms, one would
 for example simply stick a new DISPATCH tree into gimple_call_fn
 (similar to how we can have OBJ_TYPE_REF there), the DISPATCH
 tree would be of variable length, first operand the selector function
 and further operands function addresses.  That would keep the
 actual call visible (instead of a fake __builtin_dispatch call), something
 I'd really like to see.

This sounds like a good long term solution.




 Restricting ourselves to use the existing target attribute at the
 beginning (with a single, compiler-generated selector function)
 is probably good enough to get a prototype up and running.
 Extending it to arbitrary selector-function, value pairs using a
 new attribute is then probably easy (I don't see the exact use-case
 for that yet, but I suppose it exists if you say so).

For the use cases, CPU model will be looked at instead of just the
core architecture -- this will give use more information about the
numbrer of cores, size of caches etc. Intel's runtime library does
this checkiing at start up time so that the multi-versioned code can
look at those and make the appropriate decisions.

It will be even more complicated for arm processors -- which can have
the same processor cores but configured differently w.r.t VFP, NEON
etc.


 For the overloading to work we probably have to force that the
 functions are local (so we can mangle them arbitrarily) and that
 if the function should be visible externally people add an
 externally visible dispatcher (foo in the above example would be one).


For most of the cases, probably only the primary/default version needs
to be publicly visible ..

Thanks,

David


 Now, a language extension to support multi-versioning should be
 completely independent on any IL representation - with using
 a builtin you are tying them together with the only convenient
 mechanism we have - a mechanism that isn't optimal for either
 side IMNSHO.


 Yes -- they don't have to be tied -- they just happen to suite the
 needs of both ends -- but I see the value of the latest proposal
 (overloading) above.

 I did realize that using builtins was convenient (been there and done
 the same for some experiments ...).

 Richard.



Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-03 Thread Richard Guenther
On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com 
 wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

 To capture the high level semantics and prevent user from lowering the
 dispatch calls into forms compiler can not recognize, language
 extension is the way to go.

 I don't think so.  With your patch only two passes understand the new
 high-level form, the rest of the gimple passes are just confused.

 There is no need for other passes to understand it -- just treat it as
 opaque calls. This is goodness otherwise other passes need to be
 modified. This is true (only some passes understand it) for things
 like __builtin_expect.

Certainly __builtin_dispatch has to be understood by alias analysis and
all other passes that care about calls (like all IPA passes).  You can
of course treat it conservatively (may call any function, even those
which have their address not taken, clobber and read all memory, even
that which doesn't escape the TU).

Why obfuscate things when it is not necessary?


 1) the desired optimization may not happen subject to many compiler
 heuristic changes;
 2) it has other side effects such as wrong estimation of function size
 which may impact inlining

 May, may ... so you say all this can't happen under any circumstance
 with your special code and passes?

 No that is not my argument. What I tried to say is it will be harder
 to achieve without high level semantics -- it requires more
 handshaking between compiler passes.

Sure - that's life.

 Which nobody will see benefit
 from unless they rewrite their code?

 The target users for the builtin include compiler itself -- it can
 synthesize dispatch calls.

Hum.  I'm not at all sure the dispatch calls are the best representation
for the IL.

 Well, I say if we can improve
 _some_ of the existing usages that's better than never doing wrong
 on a new language extension.

 This is independent.

It is not.

 One that I'm not convinced is the way
 to go (you didn't address at all the inability to use float arguments
 and the ABI issues with using variadic arguments - after all you
 did a poor-mans language extension by using GCC builtins instead
 of inventing a true one).

 This is an independent issue that either needs to be addressed or
 marked as limitation. The key of the debate is whether source/IR
 annotation using construct with high level semantics helps optimizer.
 In fact this is common. Would it make any difference (in terms of
 acceptance) if the builtin is only used internally by the compiler and
 not exposed to the user?

No.  I don't see at all why having everything in a single stmt is so much
more convenient.  And I don't see why existing IL features cannot be
used to make things a little more convenient.

 3) it limits the lowering into one form which may not be ideal  --
 with builtin_dispatch, after hoisting optimization, the lowering can
 use more efficient IFUNC scheme, for instance.

 I see no reason why we cannot transform a switch-indirect-call
 pattern into an IFUNC call.


 It is possible -- but it is like asking user to lower the dispatch and
 tell compiler to raise it again ..

There is no possibility for a high-level dispatch at the source level.
And if I'd have to design one I would use function overloading, like

float compute_sth (float) __attribute__((version(sse4)))
{
  ... sse4 code ...
}

float compute_sth (float)
{
  ... fallback ...
}

float foo (float f)
{
  return compute_sth (f);
}

and if you not only want to dispatch for target features you could
specify a selector function and value in the attribute.  You might
notice that the above eventually matches the target attribute
directly, just the frontends need to be taught to emit dispatch
code whenever overload resolution results in ambiguities involving
target attribute differences.

Now, a language extension to support multi-versioning should be
completely independent on any IL representation - with using
a builtin you are tying them together with the only convenient
mechanism we have - a mechanism that isn't optimal for either
side IMNSHO.

 My point is that such optimization is completely independent of
 that dispatch thing.  The 

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-03 Thread Richard Guenther
On Tue, May 3, 2011 at 12:00 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 3) it limits the lowering into one form which may not be ideal  --
 with builtin_dispatch, after hoisting optimization, the lowering can
 use more efficient IFUNC scheme, for instance.

 I see no reason why we cannot transform a switch-indirect-call
 pattern into an IFUNC call.


 It is possible -- but it is like asking user to lower the dispatch and
 tell compiler to raise it again ..

 There is no possibility for a high-level dispatch at the source level.
 And if I'd have to design one I would use function overloading, like

 float compute_sth (float) __attribute__((version(sse4)))
 {
  ... sse4 code ...
 }

 float compute_sth (float)
 {
  ... fallback ...
 }

 float foo (float f)
 {
  return compute_sth (f);
 }

 and if you not only want to dispatch for target features you could
 specify a selector function and value in the attribute.  You might
 notice that the above eventually matches the target attribute
 directly, just the frontends need to be taught to emit dispatch
 code whenever overload resolution results in ambiguities involving
 target attribute differences.

Which also would allow the frontend to directly call the sse4 variant
if you compile with -msse4, avoiding the need for any fancy processing.

Richard.


Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-03 Thread Joseph S. Myers
On Tue, 3 May 2011, Mike Stump wrote:

 And to go one step further, if we had this, we could use this to define 
 all data manipulation machine built-ins as generic functions, available 
 to all compiles as normal c code, so portable code could use them 
 everywhere, and on platforms that had instructions for any of them, they 
 could define them as normal built-ins.

I don't think the forms in which some of the machine-specific built-in 
functions exist are particularly good for being available everywhere - 
they are typically defined to correspond exactly to the defined semantics 
of a particular instruction, complete with e.g. the peculiarities of how 
overflow works for that instruction.  But:

* I think it does make sense to make a range of vector operations, 
saturating operations (see the discussion in PR 48580) etc. available as 
generic GNU C on all platforms, with generically defined semantics 
(consistent with how C generally handles things such as overflow), where 
instruction sets commonly have relevant instructions but it isn't readily 
possible to represent the operations with generic C and pattern-match 
them.  (Just as we provide operations such as __builtin_popcount and 
__builtin_clz, for example - and note that __builtin_clz has undefined 
value at 0 rather than being defined to match a machine instruction.)

* Once the operations have generic GNU C, GIMPLE and RTL descriptions, 
intrinsic headers should best use the GNU C representations if possible 
instead of calling built-in functions, and any built-in functions should 
expand to the generic GIMPLE and RTL instead of using UNSPECs.

* It may also make sense for a target architecture to define its 
intrinsics in such a way that they are available on that architecture even 
when the relevant instructions aren't available in a particular 
subarchitecture.  This may be defining the intrinsics in a header if those 
are the public interface, instead of the built-in functions.  See 
a thread starting at http://gcc.gnu.org/ml/gcc/2010-11/msg00546.html for 
some discussion of doing this for SSE, for example.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-03 Thread Xinliang David Li
On Tue, May 3, 2011 at 3:00 AM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com 
 wrote:
 On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com 
 wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

 To capture the high level semantics and prevent user from lowering the
 dispatch calls into forms compiler can not recognize, language
 extension is the way to go.

 I don't think so.  With your patch only two passes understand the new
 high-level form, the rest of the gimple passes are just confused.

 There is no need for other passes to understand it -- just treat it as
 opaque calls. This is goodness otherwise other passes need to be
 modified. This is true (only some passes understand it) for things
 like __builtin_expect.

 Certainly __builtin_dispatch has to be understood by alias analysis and
 all other passes that care about calls (like all IPA passes).  You can
 of course treat it conservatively (may call any function, even those
 which have their address not taken, clobber and read all memory, even
 that which doesn't escape the TU).

 Why obfuscate things when it is not necessary?

MVed functions are usually non-trivial, so I doubt anything will be
lost due to the obfuscation. It won't be too difficult to teach
aliaser to 'merge' the attributes from target functions either.


 No that is not my argument. What I tried to say is it will be harder
 to achieve without high level semantics -- it requires more
 handshaking between compiler passes.

 Sure - that's life.


We are looking at improving the life ..

 Which nobody will see benefit
 from unless they rewrite their code?

 The target users for the builtin include compiler itself -- it can
 synthesize dispatch calls.

 Hum.  I'm not at all sure the dispatch calls are the best representation
 for the IL.


The intension is to provide an interface at both C level (for
programmers) and IL level.  It does not have to be a builtin (both
internally and externally)  -- but it needs to map to some language
construct.


 Well, I say if we can improve
 _some_ of the existing usages that's better than never doing wrong
 on a new language extension.

 This is independent.

 It is not.

 One that I'm not convinced is the way
 to go (you didn't address at all the inability to use float arguments
 and the ABI issues with using variadic arguments - after all you
 did a poor-mans language extension by using GCC builtins instead
 of inventing a true one).

 This is an independent issue that either needs to be addressed or
 marked as limitation. The key of the debate is whether source/IR
 annotation using construct with high level semantics helps optimizer.
 In fact this is common. Would it make any difference (in terms of
 acceptance) if the builtin is only used internally by the compiler and
 not exposed to the user?

 No.  I don't see at all why having everything in a single stmt is so much
 more convenient.  And I don't see why existing IL features cannot be
 used to make things a little more convenient.

Why not? The high level construct is simpler to deal with. It is all
about doing the right optimization at the right level of abstraction.
Set aside the question whether using builtin for MV dispatch is the
right high level construct, looking at gcc, we can find that gcc's IR
is pretty low level resulting in missing optimizations.

For instance, there is no high level doloop representation -- Fortran
do-loop needs to be lowered and raised back again -- the consequence
is that you may not raise the loop nest into the way it was originally
written -- perfect nested loop become non-perfect loop nest --
blocking certain loop transformations.  Not only that, I am not sure
it is even possible to record any loop level information anywhere --
is it possible to have per loop attribute such as unroll factor?

Assuming gcc can do full math function inlining (for common math
routines) -- in this case, do we still want to do sin/cos optimization
or rely on the scalar optimizer to optimize the inlined copies of sin
and cos?

Not sure about gcc, I remember that dead temporary variable removal
can be very hard to do if some intrinsic gets lowered too early
introducing 

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-02 Thread Richard Guenther
On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

You are not doing that, you are inventing a new (crude) GCC extension.

 2) Most importantly, the above non standard approaches block
 interprocedural optimizations such as inlining. With the introduction
 of buitlin_dispatch and the clear semantics, compiler can do more
 aggressive optimization.

I don't think so, but the previous mail lacked detail on why the new
scheme would be better.

 Multi-way dispatch will be good, but most of the use cases we have
 seen is 2-way -- a fast path + a default path.

 Who are the targeted consumer of the feature?

 1) For developers who like to MV function manually;
 2) For user directed function cloning

   e.g,
    int foo (...) __attribute__((clone (sse)):

 3) Automatic function cloning determined by compiler.



 I am not sure that combining the function choice and its invocation
 to a single builtin is good.  First, you use variadic arguments for
 the actual function arguments which will cause vararg promotions
 to apply and thus it will be impossible to dispatch to functions
 taking single precision float values (and dependent on ABI details
 many more similar issues).  Second, you restrict yourself to
 only two versions of a function - that looks quite limited and this
 restriction is not easy to lift with your proposed interface.

 See above. Multi-way dispatch can be added similarly.

Not with the specified syntax.  So you'd end up with _two_
language extensions.  That's bad (and unacceptable, IMHO).



 That's a nice idea, but why not simply process all functions with
 a const attribute and no arguments this way?  IMHO

 int testsse (void) __attribute__((const));

 is as good and avoids the new attribute completely.  The pass would
 also benefit other uses of this idiom (giving a way to have global
 dynamic initializers in C).  The functions may not be strictly 'const'
 in a way the compiler autodetects this attribute but it presents
 exactly the promises to the compiler that allow this optimization.

 Thus, please split this piece out into a separate patch and use
 the const attribute.


 Sounds good.


  What happens with cloning, -fclone-hot-version-paths ?
  -

 Now, here you lost me somewhat, because I didn't look into the
 patch details and I am missing an example on how the lowered
 IL looks before that cloning.  So for now I suppose this
 -fclone-hot-version-paths
 is to expose direct calls to the IL.  If you would lower __builtin_dispatch
 to a style like

  int sel = selector ();
  switch (sel)
     {
     case 0:
       fn = popcnt;
      break;
     case 1:
       fn = popcnt_sse4;
       break;
     }
   res = (*fn) (25);

 then rather than a new pass specialized for __builtin_dispatch existing
 optimization passes that are able to do tracing like VRP and DOM
 via jump-threading or the tracer pass should be able to do this
 optimization for you.  If they do not use FDO in a good way it is better
 to improve them for this case instead of writing a new pass.

 What you describe may not work

 1) the function selection may happen in a different function;

Well, sure.  I propose to have the above as lowered form.  If people
deliberately obfuscate code it's their fault.  Your scheme simply
makes that obfuscation impossible (or less likely possible).  So
1) is not a valid argument.

 2) Compiler will need to hoist the selection into the program
 initializer to avoid overhead

?  Isn't that the point of the const function call optimization which
I said was a good thing anyway?  So, after that it would be

 int sel = some_global_static;
 ...

 As an example of why dispatch hoisting and call chain cloning is needed:

 void foo();
 void bar();

 void doit_v1();
 void doit_v2();
 bool check_v () __attribute__((const));

 int test();


 void bar()
 {
    
    for (.)
     {
         foo();
         
     }
 }

 void foo()
 {
   ...
   for (...)
   {
      __builtin_dispatch(check_v, doit_v1, doit_v2,...);
     ...
   }
 }


 int test ()
 {
    ..
   bar ();
 }


 The feature testing and dispatch is embedded in a 2-deep loop nest
 crossing function boundaries. The call paths test -- bar -- foo
 needs to be cloned. This is done by hoisting dispatch up the call
 chain -- it ends up :


 void bar_v1()
 {
   
   for (..)
    {
      foo_v1 ();
    }
  ..
 }

 void bar_v2 ()
 {
    ...
    for (..)
    {
      foo_v2();
    }
   ..
 }

 void foo_v1 ()
 {
   ..
   for ()
    

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-02 Thread Xinliang David Li
On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

To capture the high level semantics and prevent user from lowering the
dispatch calls into forms compiler can not recognize, language
extension is the way to go.



 See above. Multi-way dispatch can be added similarly.

 Not with the specified syntax.  So you'd end up with _two_
 language extensions.  That's bad (and unacceptable, IMHO).

This part can be improved.




 That's a nice idea, but why not simply process all functions with
 a const attribute and no arguments this way?  IMHO

 int testsse (void) __attribute__((const));

 is as good and avoids the new attribute completely.  The pass would
 also benefit other uses of this idiom (giving a way to have global
 dynamic initializers in C).  The functions may not be strictly 'const'
 in a way the compiler autodetects this attribute but it presents
 exactly the promises to the compiler that allow this optimization.

 Thus, please split this piece out into a separate patch and use
 the const attribute.


 Sounds good.



 1) the function selection may happen in a different function;

 Well, sure.  I propose to have the above as lowered form.  If people
 deliberately obfuscate code it's their fault.  Your scheme simply
 makes that obfuscation impossible (or less likely possible).  So
 1) is not a valid argument.

Lowering too early may defeat the purpose because

1) the desired optimization may not happen subject to many compiler
heuristic changes;
2) it has other side effects such as wrong estimation of function size
which may impact inlining
3) it limits the lowering into one form which may not be ideal  --
with builtin_dispatch, after hoisting optimization, the lowering can
use more efficient IFUNC scheme, for instance.



 My point is that such optimization is completely independent of
 that dispatch thing.  The above could as well be a selection to
 use different input arrays, one causing alias analysis issues
 and one not.

 Thus, a __builtin_dispatch centric optimization pass is the wrong
 way to go.

I agree that many things can implemented in different ways, but a high
level standard builtin_dispatch mechanism doing hoisting
interprocedcurally is cleaner and simpler and targeted for function
multi-versioning -- it does not depend on/rely on later phase's
heuristic tunings to make the right things to happen. Function MV
deserves this level of treatment as it will become more and more
important for some users (e.g., Google).


 Now, with FDO I'd expect the foo is inlined into bar (because foo
 is deemed hot),

That is a myth -- the truth is that there are other heuristics which
can prevent this from happening.

 then you only need to deal with loop unswitching,
 which should be easy to drive from FDO.

Same here -- the loop body may not be well formed/recognized. The loop
nests may not be perfectly nested, or other unswitching heuristics may
block it from happening.  This is the common problem form many other
things that get lowered too early. It is cleaner to make the high
level transformation first in IPA, and let unswitching dealing with
intra-procedural optimization.

Thanks,

David


 Richard.



 Thanks,

 David




 Note that we already have special FDO support for indirect to direct
 call promotion, so that might work already.

 Thus, with all this, __builtin_dispatch would be more like syntactic
 sugar to avoid writing above switch statement yourself.  If you restrict
 that sugar to a constant number of candidates it can be replaced by
 a macro (or several ones, specialized for the number of candidates).

 Richard.

  For the call graph that looks like this before cloning :

 func_cold  func_hot  findOnes  IsPopCntSupported  
 popcnt
                                                |
                                                - no_popcnt

  where popcnt and no_popcnt are the multi-versioned functions, then after 
 cloning :

 func_cold IsPopCntSupported  func_hot_clone0  
 findOnes_clone0  popcnt
                              |
                                func_hot_clone1  findOnes_clone1 
  no_popcnt




  Flags : -fclone-hot-version-paths does function unswitching via cloning.
          --param=num-mversn-clones=num allows to specify the number of
          functions that should be cloned.
          

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-02 Thread Sriraman Tallam
Hi,

  I want to submit this patch to google/main to make sure it is
available for our internal use at Google in order to materialize some
optimization opportunities. Let us continue this dicussion as I make
changes and submit this for review for trunk.

Thanks,
-Sri.


On Mon, May 2, 2011 at 9:41 AM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com 
 wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

 To capture the high level semantics and prevent user from lowering the
 dispatch calls into forms compiler can not recognize, language
 extension is the way to go.



 See above. Multi-way dispatch can be added similarly.

 Not with the specified syntax.  So you'd end up with _two_
 language extensions.  That's bad (and unacceptable, IMHO).

 This part can be improved.




 That's a nice idea, but why not simply process all functions with
 a const attribute and no arguments this way?  IMHO

 int testsse (void) __attribute__((const));

 is as good and avoids the new attribute completely.  The pass would
 also benefit other uses of this idiom (giving a way to have global
 dynamic initializers in C).  The functions may not be strictly 'const'
 in a way the compiler autodetects this attribute but it presents
 exactly the promises to the compiler that allow this optimization.

 Thus, please split this piece out into a separate patch and use
 the const attribute.


 Sounds good.



 1) the function selection may happen in a different function;

 Well, sure.  I propose to have the above as lowered form.  If people
 deliberately obfuscate code it's their fault.  Your scheme simply
 makes that obfuscation impossible (or less likely possible).  So
 1) is not a valid argument.

 Lowering too early may defeat the purpose because

 1) the desired optimization may not happen subject to many compiler
 heuristic changes;
 2) it has other side effects such as wrong estimation of function size
 which may impact inlining
 3) it limits the lowering into one form which may not be ideal  --
 with builtin_dispatch, after hoisting optimization, the lowering can
 use more efficient IFUNC scheme, for instance.



 My point is that such optimization is completely independent of
 that dispatch thing.  The above could as well be a selection to
 use different input arrays, one causing alias analysis issues
 and one not.

 Thus, a __builtin_dispatch centric optimization pass is the wrong
 way to go.

 I agree that many things can implemented in different ways, but a high
 level standard builtin_dispatch mechanism doing hoisting
 interprocedcurally is cleaner and simpler and targeted for function
 multi-versioning -- it does not depend on/rely on later phase's
 heuristic tunings to make the right things to happen. Function MV
 deserves this level of treatment as it will become more and more
 important for some users (e.g., Google).


 Now, with FDO I'd expect the foo is inlined into bar (because foo
 is deemed hot),

 That is a myth -- the truth is that there are other heuristics which
 can prevent this from happening.

 then you only need to deal with loop unswitching,
 which should be easy to drive from FDO.

 Same here -- the loop body may not be well formed/recognized. The loop
 nests may not be perfectly nested, or other unswitching heuristics may
 block it from happening.  This is the common problem form many other
 things that get lowered too early. It is cleaner to make the high
 level transformation first in IPA, and let unswitching dealing with
 intra-procedural optimization.

 Thanks,

 David


 Richard.



 Thanks,

 David




 Note that we already have special FDO support for indirect to direct
 call promotion, so that might work already.

 Thus, with all this, __builtin_dispatch would be more like syntactic
 sugar to avoid writing above switch statement yourself.  If you restrict
 that sugar to a constant number of candidates it can be replaced by
 a macro (or several ones, specialized for the number of candidates).

 Richard.

  For the call graph that looks like this before cloning :

 func_cold  func_hot  findOnes  IsPopCntSupported  
 popcnt
                                                |
                                                - no_popcnt

  where popcnt and no_popcnt are the multi-versioned functions, then after 
 cloning :

 func_cold IsPopCntSupported  

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-02 Thread Xinliang David Li
Ok. There may be more correctness related comments -- but those can be
addressed when available. For trunk, you need to address issues such
as multi-way dispatch.

Thanks,

David

On Mon, May 2, 2011 at 12:11 PM, Sriraman Tallam tmsri...@google.com wrote:
 Hi,

  I want to submit this patch to google/main to make sure it is
 available for our internal use at Google in order to materialize some
 optimization opportunities. Let us continue this dicussion as I make
 changes and submit this for review for trunk.

 Thanks,
 -Sri.


 On Mon, May 2, 2011 at 9:41 AM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com 
 wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

 To capture the high level semantics and prevent user from lowering the
 dispatch calls into forms compiler can not recognize, language
 extension is the way to go.



 See above. Multi-way dispatch can be added similarly.

 Not with the specified syntax.  So you'd end up with _two_
 language extensions.  That's bad (and unacceptable, IMHO).

 This part can be improved.




 That's a nice idea, but why not simply process all functions with
 a const attribute and no arguments this way?  IMHO

 int testsse (void) __attribute__((const));

 is as good and avoids the new attribute completely.  The pass would
 also benefit other uses of this idiom (giving a way to have global
 dynamic initializers in C).  The functions may not be strictly 'const'
 in a way the compiler autodetects this attribute but it presents
 exactly the promises to the compiler that allow this optimization.

 Thus, please split this piece out into a separate patch and use
 the const attribute.


 Sounds good.



 1) the function selection may happen in a different function;

 Well, sure.  I propose to have the above as lowered form.  If people
 deliberately obfuscate code it's their fault.  Your scheme simply
 makes that obfuscation impossible (or less likely possible).  So
 1) is not a valid argument.

 Lowering too early may defeat the purpose because

 1) the desired optimization may not happen subject to many compiler
 heuristic changes;
 2) it has other side effects such as wrong estimation of function size
 which may impact inlining
 3) it limits the lowering into one form which may not be ideal  --
 with builtin_dispatch, after hoisting optimization, the lowering can
 use more efficient IFUNC scheme, for instance.



 My point is that such optimization is completely independent of
 that dispatch thing.  The above could as well be a selection to
 use different input arrays, one causing alias analysis issues
 and one not.

 Thus, a __builtin_dispatch centric optimization pass is the wrong
 way to go.

 I agree that many things can implemented in different ways, but a high
 level standard builtin_dispatch mechanism doing hoisting
 interprocedcurally is cleaner and simpler and targeted for function
 multi-versioning -- it does not depend on/rely on later phase's
 heuristic tunings to make the right things to happen. Function MV
 deserves this level of treatment as it will become more and more
 important for some users (e.g., Google).


 Now, with FDO I'd expect the foo is inlined into bar (because foo
 is deemed hot),

 That is a myth -- the truth is that there are other heuristics which
 can prevent this from happening.

 then you only need to deal with loop unswitching,
 which should be easy to drive from FDO.

 Same here -- the loop body may not be well formed/recognized. The loop
 nests may not be perfectly nested, or other unswitching heuristics may
 block it from happening.  This is the common problem form many other
 things that get lowered too early. It is cleaner to make the high
 level transformation first in IPA, and let unswitching dealing with
 intra-procedural optimization.

 Thanks,

 David


 Richard.



 Thanks,

 David




 Note that we already have special FDO support for indirect to direct
 call promotion, so that might work already.

 Thus, with all this, __builtin_dispatch would be more like syntactic
 sugar to avoid writing above switch statement yourself.  If you restrict
 that sugar to a constant number of candidates it can be replaced by
 a macro (or several ones, specialized for the number of candidates).

 Richard.

  For the call graph that looks like this before cloning :

 func_cold  func_hot  findOnes  

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-02 Thread Richard Guenther
On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com 
 wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

 To capture the high level semantics and prevent user from lowering the
 dispatch calls into forms compiler can not recognize, language
 extension is the way to go.

I don't think so.  With your patch only two passes understand the new
high-level form, the rest of the gimple passes are just confused.



 See above. Multi-way dispatch can be added similarly.

 Not with the specified syntax.  So you'd end up with _two_
 language extensions.  That's bad (and unacceptable, IMHO).

 This part can be improved.




 That's a nice idea, but why not simply process all functions with
 a const attribute and no arguments this way?  IMHO

 int testsse (void) __attribute__((const));

 is as good and avoids the new attribute completely.  The pass would
 also benefit other uses of this idiom (giving a way to have global
 dynamic initializers in C).  The functions may not be strictly 'const'
 in a way the compiler autodetects this attribute but it presents
 exactly the promises to the compiler that allow this optimization.

 Thus, please split this piece out into a separate patch and use
 the const attribute.


 Sounds good.



 1) the function selection may happen in a different function;

 Well, sure.  I propose to have the above as lowered form.  If people
 deliberately obfuscate code it's their fault.  Your scheme simply
 makes that obfuscation impossible (or less likely possible).  So
 1) is not a valid argument.

 Lowering too early may defeat the purpose because

 1) the desired optimization may not happen subject to many compiler
 heuristic changes;
 2) it has other side effects such as wrong estimation of function size
 which may impact inlining

May, may ... so you say all this can't happen under any circumstance
with your special code and passes?  Which nobody will see benefit
from unless they rewrite their code?  Well, I say if we can improve
_some_ of the existing usages that's better than never doing wrong
on a new language extension.  One that I'm not convinced is the way
to go (you didn't address at all the inability to use float arguments
and the ABI issues with using variadic arguments - after all you
did a poor-mans language extension by using GCC builtins instead
of inventing a true one).

 3) it limits the lowering into one form which may not be ideal  --
 with builtin_dispatch, after hoisting optimization, the lowering can
 use more efficient IFUNC scheme, for instance.

I see no reason why we cannot transform a switch-indirect-call
pattern into an IFUNC call.

 My point is that such optimization is completely independent of
 that dispatch thing.  The above could as well be a selection to
 use different input arrays, one causing alias analysis issues
 and one not.

 Thus, a __builtin_dispatch centric optimization pass is the wrong
 way to go.

 I agree that many things can implemented in different ways, but a high
 level standard builtin_dispatch mechanism doing hoisting
 interprocedcurally is cleaner and simpler and targeted for function
 multi-versioning -- it does not depend on/rely on later phase's
 heuristic tunings to make the right things to happen. Function MV
 deserves this level of treatment as it will become more and more
 important for some users (e.g., Google).

But inventing a new language extension to benefit from whatever
improvements we implement isn't the obviously best choice.

 Now, with FDO I'd expect the foo is inlined into bar (because foo
 is deemed hot),

 That is a myth -- the truth is that there are other heuristics which
 can prevent this from happening.

Huh, sure.  That doesn't make my expectation a myth.

 then you only need to deal with loop unswitching,
 which should be easy to drive from FDO.

 Same here -- the loop body may not be well formed/recognized. The loop
 nests may not be perfectly nested, or other unswitching heuristics may
 block it from happening.  This is the common problem form many other
 things that get lowered too early. It is cleaner to make the high
 level transformation first in IPA, and let unswitching dealing with
 intra-procedural optimization.

If it's not well-formed inlining the call does not make it well-formed and
thus it won't be optimized well anyway.  Btw, for the usual cases I

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-05-02 Thread Xinliang David Li
On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li davi...@google.com wrote:
 On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li davi...@google.com 
 wrote:
 Here is the background for this feature:

 1) People relies on function multi-version to explore hw features and
 squeeze performance, but there is no standard ways of doing so, either
 a) using indirect function calls with function pointers set at program
 initialization; b) using manual dispatch at each callsite; b) using
 features like IFUNC.  The dispatch mechanism needs to be promoted to
 the language level and becomes the first class citizen;

 You are not doing that, you are inventing a new (crude) GCC extension.

 To capture the high level semantics and prevent user from lowering the
 dispatch calls into forms compiler can not recognize, language
 extension is the way to go.

 I don't think so.  With your patch only two passes understand the new
 high-level form, the rest of the gimple passes are just confused.

There is no need for other passes to understand it -- just treat it as
opaque calls. This is goodness otherwise other passes need to be
modified. This is true (only some passes understand it) for things
like __builtin_expect.



 1) the desired optimization may not happen subject to many compiler
 heuristic changes;
 2) it has other side effects such as wrong estimation of function size
 which may impact inlining

 May, may ... so you say all this can't happen under any circumstance
 with your special code and passes?

No that is not my argument. What I tried to say is it will be harder
to achieve without high level semantics -- it requires more
handshaking between compiler passes.

 Which nobody will see benefit
 from unless they rewrite their code?

The target users for the builtin include compiler itself -- it can
synthesize dispatch calls.

 Well, I say if we can improve
 _some_ of the existing usages that's better than never doing wrong
 on a new language extension.

This is independent.

 One that I'm not convinced is the way
 to go (you didn't address at all the inability to use float arguments
 and the ABI issues with using variadic arguments - after all you
 did a poor-mans language extension by using GCC builtins instead
 of inventing a true one).

This is an independent issue that either needs to be addressed or
marked as limitation. The key of the debate is whether source/IR
annotation using construct with high level semantics helps optimizer.
In fact this is common. Would it make any difference (in terms of
acceptance) if the builtin is only used internally by the compiler and
not exposed to the user?


 3) it limits the lowering into one form which may not be ideal  --
 with builtin_dispatch, after hoisting optimization, the lowering can
 use more efficient IFUNC scheme, for instance.

 I see no reason why we cannot transform a switch-indirect-call
 pattern into an IFUNC call.


It is possible -- but it is like asking user to lower the dispatch and
tell compiler to raise it again ..

 My point is that such optimization is completely independent of
 that dispatch thing.  The above could as well be a selection to
 use different input arrays, one causing alias analysis issues
 and one not.

 Thus, a __builtin_dispatch centric optimization pass is the wrong
 way to go.

 I agree that many things can implemented in different ways, but a high
 level standard builtin_dispatch mechanism doing hoisting
 interprocedcurally is cleaner and simpler and targeted for function
 multi-versioning -- it does not depend on/rely on later phase's
 heuristic tunings to make the right things to happen. Function MV
 deserves this level of treatment as it will become more and more
 important for some users (e.g., Google).

 But inventing a new language extension to benefit from whatever
 improvements we implement isn't the obviously best choice.

It is not for any improvement. I mentioned the potential for function
MV and want to have a compiler infrastructure to deal with it.



 Now, with FDO I'd expect the foo is inlined into bar (because foo
 is deemed hot),

 That is a myth -- the truth is that there are other heuristics which
 can prevent this from happening.

 Huh, sure.  That doesn't make my expectation a myth.

 then you only need to deal with loop unswitching,
 which should be easy to drive from FDO.

 Same here -- the loop body may not be well formed/recognized. The loop
 nests may not be perfectly nested, or other unswitching heuristics may
 block it from happening.  This is the common problem form many other
 things that get lowered too early. It is cleaner to make the high
 level transformation first in IPA, and let unswitching dealing with
 intra-procedural optimization.

 If it's not well-formed inlining the call does not make it well-formed and
 thus it won't be 

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-04-29 Thread Richard Guenther
On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam tmsri...@google.com wrote:
 I want this patch to be considered for google/main now. This is intended to 
 be submitted to trunk for review soon.
 This patch has been tested with crosstool bootstrap using buildit and by 
 running all tests.


 Patch Description :
 ==

 Frequently executed functions in programs are some times compiled
 into different versions to take advantage of some specific
 advanced features present in some of the types of platforms on
 which the program is executed. For instance, SSE4 versus no-SSE4.
 Existing techniques to do multi-versioning, for example using
 indirect calls, block compiler optimizations, mainly inlining.

The general idea of supporting versioning is good, thanks for working on it.
Comments below.

 This patch adds a new GCC pass to support multi-versioned calls
 through a new builtin, __builtin_dispatch.  The __builtin_dispatch
 call is converted into a simple if-then construct to call the
 appropriate version at run-time. There are no indirect calls so
 inlining is not affected.

I am not sure that combining the function choice and its invocation
to a single builtin is good.  First, you use variadic arguments for
the actual function arguments which will cause vararg promotions
to apply and thus it will be impossible to dispatch to functions
taking single precision float values (and dependent on ABI details
many more similar issues).  Second, you restrict yourself to
only two versions of a function - that looks quite limited and this
restriction is not easy to lift with your proposed interface.

Thus, I would have kept regular (but indirect) calls in the source
program and only exposed the dispatch via a builtin, like ...

 This patch also supports a function unswitching optimization via
 cloning of functions, only along hot paths, that call multi-versioned
 functions so that the hot multi-versioned paths can be independently
 optimized for performance. The cloning optimization is switched on
 via a flag, -fclone-hot-version-paths.

 Only two versions are supported in this patch. Example use :

  int popcnt_sse4(unsigned int x) __attribute__((__target__(popcnt)));
  int popcnt_sse4(unsigned int x)
  {
    int count = __builtin_popcount(x);
    return count;
  }

  int popcnt(unsigned int x) __attribute__((__target__(no-popcnt)));
  int popcnt(unsigned int x)
  {
    int count = __builtin_popcount(x);
    return count;
  }

  int testsse() __attribute__((version_selector));
  int main ()
  {
    ...
    /* The multi-versioned call. */
    ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4, (void*)popcnt, 25);
    ...
  }

int main()
{
  int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse,
popcnt_sse4, popcnt);
  ret = (*ppcnt) (25);
}

which would allow the testsse function to return the argument number of
the function to select.

[snip]

  When to use the version_selector attribute ?
  ---

  Functions are marked with attribute version_selector only if
  they are run-time constants.  Example of such functions would
  be those that test if a specific feature is available on a
  particular architecture.  Such functions must return a positive
  integer. For two-way functions, those that test if a feature
  is present or not must return 1 or 0 respectively.

  This patch will make constructors that call these function once
  and assign the outcome to a global variable. From then on, only
  the value of the global will be used to decide which version to
  execute.

That's a nice idea, but why not simply process all functions with
a const attribute and no arguments this way?  IMHO

int testsse (void) __attribute__((const));

is as good and avoids the new attribute completely.  The pass would
also benefit other uses of this idiom (giving a way to have global
dynamic initializers in C).  The functions may not be strictly 'const'
in a way the compiler autodetects this attribute but it presents
exactly the promises to the compiler that allow this optimization.

Thus, please split this piece out into a separate patch and use
the const attribute.

  What happens with cloning, -fclone-hot-version-paths ?
  -

Now, here you lost me somewhat, because I didn't look into the
patch details and I am missing an example on how the lowered
IL looks before that cloning.  So for now I suppose this
-fclone-hot-version-paths
is to expose direct calls to the IL.  If you would lower __builtin_dispatch
to a style like

  int sel = selector ();
  switch (sel)
 {
 case 0:
   fn = popcnt;
  break;
 case 1:
   fn = popcnt_sse4;
   break;
 }
   res = (*fn) (25);

then rather than a new pass specialized for __builtin_dispatch existing
optimization passes that are able to do tracing like VRP and DOM
via jump-threading or the tracer pass should be able to do this
optimization for you.  If they do not 

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-04-29 Thread Xinliang David Li
Here is the background for this feature:

1) People relies on function multi-version to explore hw features and
squeeze performance, but there is no standard ways of doing so, either
a) using indirect function calls with function pointers set at program
initialization; b) using manual dispatch at each callsite; b) using
features like IFUNC.  The dispatch mechanism needs to be promoted to
the language level and becomes the first class citizen;
2) Most importantly, the above non standard approaches block
interprocedural optimizations such as inlining. With the introduction
of buitlin_dispatch and the clear semantics, compiler can do more
aggressive optimization.

Multi-way dispatch will be good, but most of the use cases we have
seen is 2-way -- a fast path + a default path.

Who are the targeted consumer of the feature?

1) For developers who like to MV function manually;
2) For user directed function cloning

   e.g,
int foo (...) __attribute__((clone (sse)):

3) Automatic function cloning determined by compiler.



 I am not sure that combining the function choice and its invocation
 to a single builtin is good.  First, you use variadic arguments for
 the actual function arguments which will cause vararg promotions
 to apply and thus it will be impossible to dispatch to functions
 taking single precision float values (and dependent on ABI details
 many more similar issues).  Second, you restrict yourself to
 only two versions of a function - that looks quite limited and this
 restriction is not easy to lift with your proposed interface.

See above. Multi-way dispatch can be added similarly.



 That's a nice idea, but why not simply process all functions with
 a const attribute and no arguments this way?  IMHO

 int testsse (void) __attribute__((const));

 is as good and avoids the new attribute completely.  The pass would
 also benefit other uses of this idiom (giving a way to have global
 dynamic initializers in C).  The functions may not be strictly 'const'
 in a way the compiler autodetects this attribute but it presents
 exactly the promises to the compiler that allow this optimization.

 Thus, please split this piece out into a separate patch and use
 the const attribute.


Sounds good.


  What happens with cloning, -fclone-hot-version-paths ?
  -

 Now, here you lost me somewhat, because I didn't look into the
 patch details and I am missing an example on how the lowered
 IL looks before that cloning.  So for now I suppose this
 -fclone-hot-version-paths
 is to expose direct calls to the IL.  If you would lower __builtin_dispatch
 to a style like

  int sel = selector ();
  switch (sel)
     {
     case 0:
       fn = popcnt;
      break;
     case 1:
       fn = popcnt_sse4;
       break;
     }
   res = (*fn) (25);

 then rather than a new pass specialized for __builtin_dispatch existing
 optimization passes that are able to do tracing like VRP and DOM
 via jump-threading or the tracer pass should be able to do this
 optimization for you.  If they do not use FDO in a good way it is better
 to improve them for this case instead of writing a new pass.

What you describe may not work

1) the function selection may happen in a different function;
2) Compiler will need to hoist the selection into the program
initializer to avoid overhead

As an example of why dispatch hoisting and call chain cloning is needed:

void foo();
void bar();

void doit_v1();
void doit_v2();
bool check_v () __attribute__((const));

int test();


void bar()
{

for (.)
 {
 foo();
 
 }
}

void foo()
{
   ...
   for (...)
   {
  __builtin_dispatch(check_v, doit_v1, doit_v2,...);
 ...
   }
}


int test ()
{
..
   bar ();
}


The feature testing and dispatch is embedded in a 2-deep loop nest
crossing function boundaries. The call paths test -- bar -- foo
needs to be cloned. This is done by hoisting dispatch up the call
chain -- it ends up :


void bar_v1()
{
   
   for (..)
{
  foo_v1 ();
}
 ..
}

void bar_v2 ()
{
...
for (..)
{
  foo_v2();
}
   ..
}

void foo_v1 ()
{
   ..
   for ()
{
  doit_v1()
}
}

void foo_v2 ()
{
..
for ()
{
   doit_v2()
}
}

int test ()
{
 __builtin_dispatch(check_v, bar_v1, bar_v2);
..

}



Thanks,

David




 Note that we already have special FDO support for indirect to direct
 call promotion, so that might work already.

 Thus, with all this, __builtin_dispatch would be more like syntactic
 sugar to avoid writing above switch statement yourself.  If you restrict
 that sugar to a constant number of candidates it can be replaced by
 a macro (or several ones, specialized for the number of candidates).

 Richard.

  For the call graph that looks like this before cloning :

 func_cold  func_hot  findOnes  IsPopCntSupported  popcnt
                                                |
                            

Re: [google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

2011-04-29 Thread Sriraman Tallam
Hi Richard,

  Thanks for the comments. Please find inline responses.

On Fri, Apr 29, 2011 at 1:56 AM, Richard Guenther
richard.guent...@gmail.com wrote:

 On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam tmsri...@google.com wrote:
  I want this patch to be considered for google/main now. This is intended to 
  be submitted to trunk for review soon.
  This patch has been tested with crosstool bootstrap using buildit and by 
  running all tests.
 
 
  Patch Description :
  ==
 
  Frequently executed functions in programs are some times compiled
  into different versions to take advantage of some specific
  advanced features present in some of the types of platforms on
  which the program is executed. For instance, SSE4 versus no-SSE4.
  Existing techniques to do multi-versioning, for example using
  indirect calls, block compiler optimizations, mainly inlining.

 The general idea of supporting versioning is good, thanks for working on it.
 Comments below.

  This patch adds a new GCC pass to support multi-versioned calls
  through a new builtin, __builtin_dispatch.  The __builtin_dispatch
  call is converted into a simple if-then construct to call the
  appropriate version at run-time. There are no indirect calls so
  inlining is not affected.

 I am not sure that combining the function choice and its invocation
 to a single builtin is good.  First, you use variadic arguments for
 the actual function arguments which will cause vararg promotions
 to apply and thus it will be impossible to dispatch to functions
 taking single precision float values (and dependent on ABI details
 many more similar issues).  Second, you restrict yourself to
 only two versions of a function - that looks quite limited and this
 restriction is not easy to lift with your proposed interface.

 Thus, I would have kept regular (but indirect) calls in the source
 program and only exposed the dispatch via a builtin, like ...

  This patch also supports a function unswitching optimization via
  cloning of functions, only along hot paths, that call multi-versioned
  functions so that the hot multi-versioned paths can be independently
  optimized for performance. The cloning optimization is switched on
  via a flag, -fclone-hot-version-paths.
 
  Only two versions are supported in this patch. Example use :
 
   int popcnt_sse4(unsigned int x) __attribute__((__target__(popcnt)));
   int popcnt_sse4(unsigned int x)
   {
     int count = __builtin_popcount(x);
     return count;
   }
 
   int popcnt(unsigned int x) __attribute__((__target__(no-popcnt)));
   int popcnt(unsigned int x)
   {
     int count = __builtin_popcount(x);
     return count;
   }
 
   int testsse() __attribute__((version_selector));
   int main ()
   {
     ...
     /* The multi-versioned call. */
     ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4, (void*)popcnt, 
  25);
     ...
   }

 int main()
 {
  int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse,
 popcnt_sse4, popcnt);
  ret = (*ppcnt) (25);
 }

 which would allow the testsse function to return the argument number of
 the function to select.

 [snip]

   When to use the version_selector attribute ?
   ---
 
   Functions are marked with attribute version_selector only if
   they are run-time constants.  Example of such functions would
   be those that test if a specific feature is available on a
   particular architecture.  Such functions must return a positive
   integer. For two-way functions, those that test if a feature
   is present or not must return 1 or 0 respectively.
 
   This patch will make constructors that call these function once
   and assign the outcome to a global variable. From then on, only
   the value of the global will be used to decide which version to
   execute.

 That's a nice idea, but why not simply process all functions with
 a const attribute and no arguments this way?  IMHO

 int testsse (void) __attribute__((const));

 is as good and avoids the new attribute completely.  The pass would
 also benefit other uses of this idiom (giving a way to have global
 dynamic initializers in C).  The functions may not be strictly 'const'
 in a way the compiler autodetects this attribute but it presents
 exactly the promises to the compiler that allow this optimization.

Since, const functions cannot, not according to the definition atleast
,  read files or memory, I was thinking along the lines of making it a
 pure function (For example, The version_selector function could be
reading /proc/cpuinfo and looking for certain features). The reason
I invented a new attribute was because I assumed it is not legal to
hoist the call site of a pure function into a constructor and
substitute the value computed simply because it may not be valid to
call the pure function ahead of the intended time in the execution.

Basically, the semantics of the version_selector is that the value
returned is a run-time constant and that it can be