On Tue, Feb 11, 2020 at 12:03 PM Devulapalli, Raghuveer < raghuveer.devulapa...@intel.com> wrote:
> >> I think this doesn't quite answer the question. If I understand > correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and > it's missing from the supported AVX512 instructions in master). I think > the answer is yes, it needs to be added for other architectures as well. > > > > That adds a lot of overhead to write SIMD based optimizations which can > discourage contributors. > Keep in mind that a new universal intrinsics instruction is just a bunch of defines. That is way less work than writing a ufunc that uses that instruction. We can also ping a platform expert in case it's not obvious what the corresponding arch-specific instruction is - that's a bit of a chicken-and-egg problem; once we get going we hopefully get more interested people that can help each other out. > It’s also an unreasonable expectation that a developer be familiar with > SIMD of all the architectures. On top of that the performance implications > aren’t clear. Software implementations of hardware instructions might > perform worse and might not even produce the same result. > I think you are worrying about writing ufuncs here, not about adding an instruction. If the same result is not produced, we have CI that should fail - and if it does, we can deal with that by (if it's not easy to figure out) making that platform fall back to the generic non-SIMD version of the ufunc. Cheers, Ralf > > > *From:* NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli= > intel....@python.org> *On Behalf Of *Ralf Gommers > *Sent:* Monday, February 10, 2020 9:17 PM > *To:* Discussion of Numerical Python <numpy-discussion@python.org> > *Subject:* Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics > > > > > > > > On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edi...@gmail.com> > wrote: > > —snip— > > > > > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, > if contributors want to leverage a new architecture specific SIMD > instruction, will they be expected to add software implementation of this > instruction for all other architectures too? > > > > In my opinion, if the instructions are lower, then yes. For example, one > cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 > and SSE*. However, I would not expect one person or team to be an expert > in all assemblies, so intrinsics for one architecture can be developed > independently of another. > > > > I think this doesn't quite answer the question. If I understand correctly, > it's about a single instruction (e.g. one needs "VEXP2PD" and it's > missing from the supported AVX512 instructions in master). I think the > answer is yes, it needs to be added for other architectures as well. > Otherwise, if universal intrinsics are added ad-hoc and there's no > guarantee that a universal instruction is available for all main supported > platforms, then over time there won't be much that's "universal" about the > framework. > > > > This is a different question though from adding a new ufunc > implementation. I would expect accelerating ufuncs via intrinsics that are > already supported to be much more common than having to add new intrinsics. > Does that sound right? > > > > > > 2) On whom does the burden lie to ensure that new implementations are > benchmarked and shows benefits on every architecture? What happens if > optimizing an Ufunc leads to improving performance on one architecture and > worsens performance on another? > > > > This is slightly hard to provide a recipe for. I suspect it may take a > while before this becomes an issue, since we don't have much SIMD code to > begin with. So adding new code with benchmarks will likely show > improvements on all architectures (we should ensure benchmarks can be run > via CI, otherwise it's too onerous). And if not and it's not easily > fixable, the problematic platform could be skipped so performance there is > unchanged. > > > > Only once there's existing universal intrinsics and then they're tweaked > will we have to be much more careful I'd think. > > > > Cheers, > > Ralf > > > > > > > > I would look at this from a maintainability point of view. If we are > increasing the code size by 20% for a certain ufunc, there must be a > domonstrable 20% increase in performance on any CPU. That is to say, > micro-optimisation will be unwelcome, and code readability will be > preferable. Usually we ask the submitter of the PR to test the PR with a > machine they have on hand, and I would be inclined to keep this trend of > self-reporting. Of course, if someone else came along and reported a > performance regression of, say, 10%, then we have increased code by 20%, > with only a net 5% gain in performance, and the PR will have to be reverted. > > > > —snip— > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion