Did a quick bench of accessing long buffer not 8 bytes aligned. Giving enough conditions, looks it does shows unaligned access has some penalty over aligned access. But I don't think this is an issue in practice.

Please be very skeptical to this benchmark. It's hard to get it right given the complexity of hardware, compiler, benchmark tool and env.

https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk


On 9/7/21 7:55 AM, Micah Kornfield wrote:

My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.


This is probably true.  [1] is the original mailing list discussion.  I
think lack of measurable differences and high overhead for 64 byte
alignment was the reason for relaxing to 8 byte alignment.

Specifically, I performed two types of tests, a "random sum" where we
compute the sum of the values taken at random indices, and "sum", where we
sum all values of the array (buffer[1] of the primitive array), both for
array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
the latter, prefetching would help, but I do not observe any difference.


The most likely place I think where this could make a difference would be
for operations on wider types (Decimal128 and Decimal256).   Another place
where I think alignment could help is when adding two primitive arrays (it
sounds like this was summing a single array?).

[1]
https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E

On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <anto...@python.org> wrote:


Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
Thanks a lot Antoine for the pointers. Much appreciated!

Generally, it should not hurt to align allocations to 64 bytes anyway,
since you are generally dealing with large enough data that the
(small) memory overhead doesn't matter.

Not for performance. However, 64 byte alignment in Rust requires
maintaining a custom container, a custom allocator, and the inability to
interoperate with `std::Vec` and the ecosystem that is based on it, since
std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
anyone
interested, the background for this is this old PR [1] in this in arrow2
[2].

I see. In the C++ implementation, we are not compatible with the default
allocator either (but C++ allocators as defined by the standard library
don't support resizing, which doesn't make them terribly useful for
Arrow anyway).

Neither myself in micro benches nor Ritchie from polars (query engine) in
large scale benches observe any difference in the archs we have
available.
This is not consistent with the emphasis we put on the memory alignments
discussion [3], and I am trying to understand the root cause for this
inconsistency.

My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.

By prefetching I mean implicit; no intrinsics involved.

Well, I'm not aware that implicit prefetching depends on alignment.

Regards

Antoine.


Reply via email to