Hi Sasha,

Thanks for the advice. I didn't quite catch the point. Would you explain a bit 
the purpose of this proposal?

We do prefer compiler auto-vectorization to explicit simd code, even if the c++ 
code is slower than simd one (20% is acceptable IMO). And we do support runtime 
dispatch kernels based on target machine arch.

Then what is left to talk is how to deal with codes that are not 
auto-vectorizable but can be manually optimized with simd instructions. Looks 
your proposal is to do nothing more than adding appropriate compiler flags and 
wait for compilers become smarter in the future. I think this is a reasonable 
approach, probably is many cases. But if we do want to manually tune the code, 
I believe a simd library is the best way.

To me there's no "replacing" between xsimd and auto-vectorization, they just do 
their own jobs.

Yibo

-----Original Message-----
From: Sasha Krassovsky <krassovskysa...@gmail.com>
Sent: Wednesday, March 30, 2022 6:58 AM
To: dev@arrow.apache.org; emkornfi...@gmail.com
Subject: Re: [C++] Replacing xsimd with compiler autovectorization

xsimd has three problems I can think of right now:
1) xsimd code looks like normal simd code: you have to explicitly do loads and 
stores, you have to explicitly unroll and stride through your loop, and you 
have to explicitly process the tail of the loop. This makes writing a large 
number of kernels extremely tedious and error-prone. In comparison, writing a 
single three-line scalar for loop is easier to both read and write.
2) xsimd limits the freedom an optimizer has to select instructions and do 
other optimizations, as it's just a thin wrapper over normal intrinsics.
One concrete example would be if we wanted to take advantage of the dynamic 
dispatch instruction set xsimd offers, the loop strides would no longer be 
compile-time constants, which might prevent the compiler from performing loop 
unrolling (how would it know that the stride isn't just 1?)
3) Lastly, if we ever want to support a new architecture (like Power9 or 
RISC-V), we'd have to wait for an xsimd backend to become available. On the 
other hand, if SiFive came out with a hot new chip supporting RV64V, all we'd 
have to do to support it is to add the appropriate compiler flag into the 
CMakeLists.

As for using an external build system, I'm not sure how much complexity it 
would add, but at the very least I suspect it would work out of the box if you 
only wanted to support scalar kernels. Otherwise I don't think it would add 
much more complexity than we currently have detecting architectures at 
buildtime.

Sasha

On Tue, Mar 29, 2022 at 3:26 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Sasha,
> Could you elaborate on the problems of the XSIMD dependency?  What you
> describe sounds a lot like what XSIMD provides in a prepackaged form
> and without the extra CMake magic.
>
> I  have to occasionally build Arrow with an external build system and
> it sounds like this type of logic could add complexity there.
>
> Thanks,
> Micah
>
> On Tue, Mar 29, 2022 at 3:14 PM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> wrote:
>
> > Hi everyone,
> > I've noticed that we include xsimd as an abstraction over all of the
> > simd architectures. I'd like to propose a different solution which
> > would
> result
> > in fewer lines of code, while being more readable.
> >
> > My thinking is that anything simple enough to abstract with xsimd
> > can be autovectorized by the compiler. Any more interesting SIMD
> > algorithm
> usually
> > is tailored to the target instruction set and can't be abstracted
> > away
> with
> > xsimd anyway.
> >
> > With that in mind, I'd like to propose the following strategy:
> > 1. Write a single source file with simple, element-at-a-time for
> > loop implementations of each function.
> > 2. Compile this same source file several times with different
> > compile
> flags
> > for different vectorization (e.g. if we're on an x86 machine that
> supports
> > AVX2 and AVX512, we'd compile once with -mavx2 and once with -mavx512vl).
> > 3. Functions compiled with different instruction sets can be
> differentiated
> > by a namespace, which gets defined during the compiler invocation.
> > For example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2
> > and then for something like elementwise addition of two arrays, we'd
> > call arrow::AVX2::VectorAdd.
> >
> > I believe this would let us remove xsimd as a dependency while also
> giving
> > us lots of vectorized kernels at the cost of some extra cmake magic.
> After
> > that, it would just be a matter of making the function registry
> > point to these new functions.
> >
> > Please let me know your thoughts!
> >
> > Thanks,
> > Sasha Krassovsky
> >
>
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

Reply via email to