On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yibo....@arm.com> wrote:
>
> Thanks Wes, I'm glad to see this feature coming.
>
>  From history talks, the main concern is runtime dispatcher may cause 
> performance issue.
> Personally, I don't think it's a big problem. If we're using SIMD, it must be 
> targeting some time consuming code.
>
> But we do need to take care some issues. E.g, I see code like this:
> for (int i = 0; i < n; ++i) {
>    simd_code();
> }
> With runtime dispatcher, it becomes an indirect function call in each 
> iteration.
> We should change the code to move the loop inside simd_code().

To be clear, I'm referring to SIMD-optimized code that operates on
batches of data. The overhead of choosing an implementation based on a
global settings object should not be meaningful. If there is
performance-sensitive code at inline call sites then I agree that it
is an issue. I don't think that characterizes most of the anticipated
work in Arrow, though, since functions generally will process a
chunk/array of data at time (see, e.g. Parquet encoding/decoding work
recently).

> It would be better if you can consider architectures other than x86(at 
> framework level).
> Ignore it if it costs much effort. We can always improve later.
>
> Yibo
>
> On 5/13/20 9:46 AM, Wes McKinney wrote:
> > hi,
> >
> > We've started to receive a number of patches providing SIMD operations
> > for both x86 and ARM architectures. Most of these patches make use of
> > compiler definitions to toggle between code paths at compile time.
> >
> > This is problematic for a few reasons:
> >
> > * Binaries that are shipped (e.g. in Python) must generally be
> > compiled for a broad set of supported compilers. That means that AVX2
> > / AVX512 optimizations won't be available in these builds for
> > processors that have them
> > * Poses a maintainability and testing problem (hard to test every
> > combination, and it is not practical for local development to compile
> > every combination, which may cause drawn out test/CI/fix cycles)
> >
> > Other projects (e.g. NumPy) have taken the approach of building
> > binaries that contain multiple variants of a function with different
> > levels of SIMD, and then choosing at runtime which one to execute
> > based on what features the CPU supports. This seems like what we
> > ultimately need to do in Apache Arrow, and if we continue to accept
> > patches that do not do this, it will be much more work later when we
> > have to refactor things to runtime dispatching.
> >
> > We have some PRs in the queue related to SIMD. Without taking a heavy
> > handed approach like starting to veto PRs, how would everyone like to
> > begin to address the runtime dispatching problem?
> >
> > Note that the Kernels revamp project I am working on right now will
> > also facilitate runtime SIMD kernel dispatching for array expression
> > evaluation.
> >
> > Thanks,
> > Wes
> >

Reply via email to