[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment

Weston Pace (Jira) Fri, 21 Oct 2022 10:33:15 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622376#comment-17622376
 ]


Weston Pace commented on ARROW-17783:
-------------------------------------

FWIW, the particular check failing here isn't even looking for 64-byte 
alignment.  It is simply looking for type alignment (which is guaranteed by 
`malloc`).  The algorithm is not using an SIMD intrinsics.  It is comparing 
values between two byte buffers (of fixed length arrays), given random indices, 
and doing so by casting the uint8_t* to uint64_t* (or whatever is appropriate). 
 Roughly speaking...

{noformat}
*match = *(reinterpret_cast<const uint64_t*>(left.data()) + row_num) == 
*(reinterpret_cast<const uint64_t*>(right.data()) + row_num);
{noformat}

A safe alternative, that does not rely on alignment, would be:

{noformat}
*match = memcmp(left.data() + (row_num * 8), right.data() + (row_num * 8), 8) 
== 0;
{noformat}

I created a hasty benchmark:

{noformat}
void CompareWithMemcmp(const uint8_t* left, const uint8_t* right, uint8_t* out,
                       const int* indices, int length) {
  const int* idx_it = indices;
  const int* idx_end = indices + length;
  uint8_t* out_it = out;
  while (idx_it != idx_end) {
    *out_it = memcmp(left + (*idx_it * 8), right + (*idx_it * 8), 8) == 0;
    out_it++;
    idx_it++;
  }
}

void CompareWithCast(const uint8_t* left, const uint8_t* right, uint8_t* out,
                     const int* indices, int length) {
  const int* idx_it = indices;
  const int* idx_end = indices + length;
  uint8_t* out_it = out;
  while (idx_it != idx_end) {
    *out_it = *(reinterpret_cast<const uint64_t*>(left) + *idx_it) ==
              *(reinterpret_cast<const uint64_t*>(right) + *idx_it);
    out_it++;
    idx_it++;
  }
}

template <bool kUseCast, bool kAligned>
static void RandomCompare(benchmark::State& state) {  // NOLINT non-const 
reference
  constexpr int kNumElements = 10000;
  const std::vector<int64_t> left = MakeIntegers<int64_t>(kNumElements + 1);
  const std::vector<int64_t> right = MakeIntegers<int64_t>(kNumElements + 1);
  std::vector<int> indices(kNumElements);
  std::vector<uint8_t> matches(kNumElements);
  std::iota(indices.begin(), indices.end(), 0);
  std::default_random_engine gen(42);
  std::shuffle(indices.begin(), indices.end(), gen);
  const uint8_t* left_start = reinterpret_cast<const uint8_t*>(left.data());
  const uint8_t* right_start = reinterpret_cast<const uint8_t*>(right.data());
  if (!kAligned) {
    left_start += 4;
    right_start += 4;
  }
  for (auto _ : state) {
    if (kUseCast) {
      CompareWithCast(left_start, right_start, matches.data(), indices.data(),
                      kNumElements);
    } else {
      CompareWithMemcmp(left_start, right_start, matches.data(), indices.data(),
                        kNumElements);
    }
  }
}

static void RandomCompareMemcmpAligned(benchmark::State& state) {
  RandomCompare<false, true>(state);
}
static void RandomCompareMemcmpUnaligned(benchmark::State& state) {
  RandomCompare<false, false>(state);
}
static void RandomCompareCast(benchmark::State& state) {
  RandomCompare<true, true>(state);
}
{noformat}

Results:

{noformat}
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
RandomCompareMemcmpAligned         9593 ns         9592 ns        67730
RandomCompareMemcmpUnaligned      10189 ns        10187 ns        65334
RandomCompareCast                  9103 ns         9102 ns        81895
{noformat}

Now, I am sure there is some alternative that can be devised, that is both safe 
and performant.  However, this seems a burden.  It's a burden to develop safe 
algorithms.  It's a burden to ensure that we unit test thoroughly enough (e.g. 
fuzz with random unaligned buffers) to avoid segmentation fault.  If someone is 
willing to doing this work then I'm happy to consider an alternative proposal.  
However, I'd rather see more development, in a safe fashion, by manually 
aligning buffers, than spend time supporting a use case that it is not clear 
has a solid need (e.g. it seems we can fix this particular case and, even if we 
can't, does the performance hit of forced-alignment matter in this case?).

> [C++] Aggregate kernel should not mandate alignment
> ---------------------------------------------------
>
>                 Key: ARROW-17783
>                 URL: https://issues.apache.org/jira/browse/ARROW-17783
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.0, 8.0.0
>            Reporter: Yifei Yang
>            Assignee: Weston Pace
>            Priority: Major
>         Attachments: flight-alignment-test.zip
>
>
> When using arrow's aggregate kernel with table transferred from arrow flight 
> (DoGet), it may crash at arrow::util::CheckAlignment(). However using 
> original data it works well, also if I first serialize the transferred table 
> into bytes then recreate an arrow table using the bytes, it works well.
> "flight-alignment-test" attached is the minimal test that can produce the 
> issue, which basically does "sum(total_revenue) group by l_suppkey" using the 
> table from "DoGet()". ("DummyNode" is just used to be the producer of the 
> aggregate node as the producer is required to create the aggregate node)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment

Reply via email to