[
https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622376#comment-17622376
]
Weston Pace commented on ARROW-17783:
-------------------------------------
FWIW, the particular check failing here isn't even looking for 64-byte
alignment. It is simply looking for type alignment (which is guaranteed by
`malloc`). The algorithm is not using an SIMD intrinsics. It is comparing
values between two byte buffers (of fixed length arrays), given random indices,
and doing so by casting the uint8_t* to uint64_t* (or whatever is appropriate).
Roughly speaking...
{noformat}
*match = *(reinterpret_cast<const uint64_t*>(left.data()) + row_num) ==
*(reinterpret_cast<const uint64_t*>(right.data()) + row_num);
{noformat}
A safe alternative, that does not rely on alignment, would be:
{noformat}
*match = memcmp(left.data() + (row_num * 8), right.data() + (row_num * 8), 8)
== 0;
{noformat}
I created a hasty benchmark:
{noformat}
void CompareWithMemcmp(const uint8_t* left, const uint8_t* right, uint8_t* out,
const int* indices, int length) {
const int* idx_it = indices;
const int* idx_end = indices + length;
uint8_t* out_it = out;
while (idx_it != idx_end) {
*out_it = memcmp(left + (*idx_it * 8), right + (*idx_it * 8), 8) == 0;
out_it++;
idx_it++;
}
}
void CompareWithCast(const uint8_t* left, const uint8_t* right, uint8_t* out,
const int* indices, int length) {
const int* idx_it = indices;
const int* idx_end = indices + length;
uint8_t* out_it = out;
while (idx_it != idx_end) {
*out_it = *(reinterpret_cast<const uint64_t*>(left) + *idx_it) ==
*(reinterpret_cast<const uint64_t*>(right) + *idx_it);
out_it++;
idx_it++;
}
}
template <bool kUseCast, bool kAligned>
static void RandomCompare(benchmark::State& state) { // NOLINT non-const
reference
constexpr int kNumElements = 10000;
const std::vector<int64_t> left = MakeIntegers<int64_t>(kNumElements + 1);
const std::vector<int64_t> right = MakeIntegers<int64_t>(kNumElements + 1);
std::vector<int> indices(kNumElements);
std::vector<uint8_t> matches(kNumElements);
std::iota(indices.begin(), indices.end(), 0);
std::default_random_engine gen(42);
std::shuffle(indices.begin(), indices.end(), gen);
const uint8_t* left_start = reinterpret_cast<const uint8_t*>(left.data());
const uint8_t* right_start = reinterpret_cast<const uint8_t*>(right.data());
if (!kAligned) {
left_start += 4;
right_start += 4;
}
for (auto _ : state) {
if (kUseCast) {
CompareWithCast(left_start, right_start, matches.data(), indices.data(),
kNumElements);
} else {
CompareWithMemcmp(left_start, right_start, matches.data(), indices.data(),
kNumElements);
}
}
}
static void RandomCompareMemcmpAligned(benchmark::State& state) {
RandomCompare<false, true>(state);
}
static void RandomCompareMemcmpUnaligned(benchmark::State& state) {
RandomCompare<false, false>(state);
}
static void RandomCompareCast(benchmark::State& state) {
RandomCompare<true, true>(state);
}
{noformat}
Results:
{noformat}
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
RandomCompareMemcmpAligned 9593 ns 9592 ns 67730
RandomCompareMemcmpUnaligned 10189 ns 10187 ns 65334
RandomCompareCast 9103 ns 9102 ns 81895
{noformat}
Now, I am sure there is some alternative that can be devised, that is both safe
and performant. However, this seems a burden. It's a burden to develop safe
algorithms. It's a burden to ensure that we unit test thoroughly enough (e.g.
fuzz with random unaligned buffers) to avoid segmentation fault. If someone is
willing to doing this work then I'm happy to consider an alternative proposal.
However, I'd rather see more development, in a safe fashion, by manually
aligning buffers, than spend time supporting a use case that it is not clear
has a solid need (e.g. it seems we can fix this particular case and, even if we
can't, does the performance hit of forced-alignment matter in this case?).
> [C++] Aggregate kernel should not mandate alignment
> ---------------------------------------------------
>
> Key: ARROW-17783
> URL: https://issues.apache.org/jira/browse/ARROW-17783
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 6.0.0, 8.0.0
> Reporter: Yifei Yang
> Assignee: Weston Pace
> Priority: Major
> Attachments: flight-alignment-test.zip
>
>
> When using arrow's aggregate kernel with table transferred from arrow flight
> (DoGet), it may crash at arrow::util::CheckAlignment(). However using
> original data it works well, also if I first serialize the transferred table
> into bytes then recreate an arrow table using the bytes, it works well.
> "flight-alignment-test" attached is the minimal test that can produce the
> issue, which basically does "sum(total_revenue) group by l_suppkey" using the
> table from "DoGet()". ("DummyNode" is just used to be the producer of the
> aggregate node as the producer is required to create the aggregate node)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)