Hi, Ricardo Wurmus <rek...@elephly.net> skribis:
> I was wondering how we should go about optionally building software for > more advanced CPU features. Currently, we build software for the lowest > common feature set among x86_64 CPUs. That’s good for portability, but > not so good for performance. > > Enabling CPU features often happens through configure flags, but > expressing support at that level in our package definitions seems bad. > How can we make it possible for users to build their software for > different CPUs? To some extent, I think this is a compiler/OS/upstream issue. By that I mean that the best way to achieve use of extra CPU features is by using the “IFUNC” feature of GNU ld.so, which is what libc does (it has variants of strcmp etc. tweaked for various CPU extensions like SSE, and the right one gets picked up at load time.) Software like GMP, Nettle, or MPlayer also does this kind of selection at run time, but using custom mechanisms. GCC now has a ‘target_clones’ function attribute, which instructs it to generate several variants of a function and use IFUNC to pick up the right one (info "(gcc) Common Function Attributes"). Ideally, upstream would use this. When upstream does that, we have portable-yet-efficient “fat” binaries, and there’s nothing to do on our side. :-) > We can cross-compile for other architectures on the command line with > “--target” and “--system”; can we allow for compilation with special CPU > features across the graph with “--features”? Build system abstractions > or package definitions would then be changed to recognize these features > and modify the corresponding flags as needed. I’ve considered this, but designing this would be tricky, and not quite right IMO. There’s probably scientific software out there that can benefit from using the latest SSE/AVX/whatever extension, and yet doesn’t use any of the tricks above. When we find such a piece of software, I think we should investigate and (1) see whether it actually benefits from those ISA extensions, and (2) see whether it would be feasible to just use ‘target_clones’ or similar on the hot spots. If it turns out that this approach doesn’t scale or isn’t suitable, then we can think more about what you suggest. But before starting such an endeavor, I would really like to get a better understanding of the software we’re talking about and the options that we have. WDYT? Thanks, Ludo’.