On 2013-09-08, at 16:04 , Pekka Jääskeläinen <[email protected]> wrote:
> On 09/08/2013 10:13 PM, Erik Schnetter wrote: >> Yes, we can use the same kernel library for all x86-64 architectures. >> However, this would require disabling many performance features. > > Which ones do you mean? The kernel libs are LLVM bitcode, which is not > yet very target-specific as such, so the target-specific features are > basically inline asm blocks selected using #ifdefs when building the > bc? > > IMO inline asm blocks should be avoided in the longer term anyways (try to use > intrinsics instead) as we want to vectorize WGs as efficiently as possible > and it's easier to vectorize intrinsics calls than inline asm blocks. The asm inline statements are gone (in Vecmathlib); it's all done via intrinsics and Clang extensions. > Same goes to vector datatypes: we might want to "scalarize" them for > more efficient WG vectorization, so not always we want to use hand coded > vectorized versions of functions dealing with them. > > So, what about producing a "generic" bitcode lib without CPU feature > specific inline asm blocks (perhaps only intrinsics calls), and then let > the llc do its magic based on autodetection? The very final call to > the llc from the fully linked work group function bitcode should be of > the most importance here, right? Some CPU attributes influence the ABI. These need to be set correctly at all times, otherwise the executable won't work. This influences e.g. the calling conventions for functions, which is explicitly represented in bytecode. That is, a fully generic bytecode library is not possible, but we may be able to get away with using just a few per architecture. One would probably also need to make sure that earlier optimizations don't already expand builtins, since a different CPU may offer a more efficient implementation in terms of a CPU instruction that exists only on some CPUs (e.g. popcount, clz). Apart from this -- implementing the kernel library purely with scalar functions and builtins is possible. We would have to experiment with how to present this to the vectorizer to make things as easy as possible. Currently, we split e.g. int16 into two int8 operations; this is a nicely recursive implementation, but the vectorizer may prefer a loop instead. I should introduce an option to Vecmathlib to do this. This would easily allow comparing performance, and could give hints to shortcomings of the vectorizer (and conversely, of Vecmathlib) that could then be addressed. -erik -- Erik Schnetter <[email protected]> http://www.perimeterinstitute.ca/personal/eschnetter/ My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://pgp.mit.edu/.
signature.asc
Description: Message signed with OpenPGP using GPGMail
------------------------------------------------------------------------------ Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
