Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

via GitHub Mon, 06 Nov 2023 05:21:39 -0800


js8544 commented on code in PR #38502:
URL: https://github.com/apache/arrow/pull/38502#discussion_r1383103558



##########
cpp/src/arrow/array/array_dict.h:
##########
@@ -111,6 +112,14 @@ class ARROW_EXPORT DictionaryArray : public Array {
   /// value is null or out-of-bounds.
   int64_t GetValueIndex(int64_t i) const;
 
+  std::string GetView(int64_t i) const {
+    // Obtain the index at position i
+    const auto index = GetValueIndex(i);
+
+    // Assuming the dictionary is a String

Review Comment:
   We can't really assume the dictionary is a string here as Arrow's Dictionary 
type can support any type.



##########
cpp/src/arrow/compute/kernels/codegen_internal.h:
##########
@@ -187,6 +187,14 @@ struct GetOutputType<Decimal256Type> {
   using T = Decimal256;
 };
 
+template <>
+struct GetViewType<DictionaryType> {

Review Comment:
   Again, we can't assume a string dictionary. It's probably necessary to 
dispatch your implementation based on the concrete dictionary type. See the 
[visitor 
pattern](https://arrow.apache.org/docs/cpp/datatypes.html#visitor-pattern) and 
the 
[`GetArraySorter`](https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_array_sort.cc#L665)
 function for an example.



##########
cpp/src/arrow/array/array_dict.h:
##########
@@ -111,6 +112,14 @@ class ARROW_EXPORT DictionaryArray : public Array {
   /// value is null or out-of-bounds.
   int64_t GetValueIndex(int64_t i) const;
 
+  std::string GetView(int64_t i) const {
+    // Obtain the index at position i
+    const auto index = GetValueIndex(i);
+
+    // Assuming the dictionary is a String
+    return dictionary()->GetScalar(index).ValueOrDie()->ToString();

Review Comment:
   We don't use `ValueOrDie` except in tests or benchmarks because we don't 
want users' program to crash when they depend on Arrow. We generally use 
`ARROW_ASSIGN_OR_RAISE` and `ARROW_RETURN_NOT_OK` to propagate the `Status` to 
the caller.



##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -190,6 +236,67 @@ class ChunkedArraySorter : public TypeVisitor {
   NullPartitionResult* output_;
 };
 
+template <>
+void ChunkedArraySorter::MergeNonNulls<DictionaryType>(
+    uint64_t* range_begin, uint64_t* range_middle, uint64_t* range_end,
+    const std::vector<const Array*>& arrays, uint64_t* temp_indices) {
+  const ChunkedArrayResolver left_resolver(arrays);
+  const ChunkedArrayResolver right_resolver(arrays);
+
+  // concatenate all dictionary arrays to calculate rank
+  ArrayVector dicts_array;
+  for (const auto& array : arrays) {
+    const auto& dicts = checked_cast<const DictionaryArray&>(*array);
+    dicts_array.push_back(dicts.dictionary());
+  }
+  auto concat_dict = Concatenate(dicts_array).ValueOrDie();
+  auto ranks = RanksWithNulls(concat_dict).ValueOrDie();
+
+  // build unified rank map using dicts and rank array
+  std::unordered_map<std::string, int64_t> unified_rank_map;
+  for (int i = 0; i < concat_dict->length(); i++) {
+    auto dict = concat_dict->GetScalar(i).ValueOrDie();

Review Comment:
   We tend to avoid GetScalar because it's expensive (It unwraps the array as 
its real type, gets the value, and wraps it as the scalar class again). Instead 
we often do `checked_cast<ConcreteArrayType&>(*arr)->GetView(i)` to directly 
get its ctype value.



##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -115,8 +118,13 @@ class ChunkedArraySorter : public TypeVisitor {
       };
       auto merge_non_nulls = [&](uint64_t* range_begin, uint64_t* range_middle,
                                  uint64_t* range_end, uint64_t* temp_indices) {
-        MergeNonNulls<ArrayType>(range_begin, range_middle, range_end, arrays,
-                                 temp_indices);
+        if (is_dictionary(*physical_type_)) {
+          MergeNonNulls<DictionaryType>(range_begin, range_middle, range_end, 
arrays,

Review Comment:
   Can we merge these two branches? Is DictionaryType different from ArrayType?



##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -179,6 +187,44 @@ class ChunkedArraySorter : public TypeVisitor {
     std::copy(temp_indices, temp_indices + (range_end - range_begin), 
range_begin);
   }
 
+  // Duplicate of ArrayCompareSorter
+  static Result<std::shared_ptr<Array>> RanksWithNulls(
+      const std::shared_ptr<Array>& array) {
+    // Notes:
+    // * The order is always ascending here, since the goal is to produce
+    //   an exactly-equivalent-order of the dictionary values.
+    // * We're going to re-emit nulls in the output, so we can just always 
consider
+    //   them "at the end".  Note that choosing AtStart would merely shift 
other
+    //   ranks by 1 if there are any nulls...
+    RankOptions rank_options(SortOrder::Ascending, NullPlacement::AtEnd,
+                             RankOptions::Dense);
+
+    auto data = array->data();
+    std::shared_ptr<Buffer> null_bitmap;
+    if (array->null_count() > 0) {
+      null_bitmap = array->null_bitmap();
+      data = array->data()->Copy();
+      if (data->offset > 0) {
+        ARROW_ASSIGN_OR_RAISE(
+            null_bitmap,
+            arrow::internal::CopyBitmap(arrow::default_memory_pool(), 
null_bitmap->data(),
+                                        data->offset, data->length));
+      }
+      data->buffers[0] = nullptr;
+      data->null_count = 0;
+    }
+    ARROW_ASSIGN_OR_RAISE(auto rank_datum,
+                          CallFunction("rank", {std::move(data)}, 
&rank_options));

Review Comment:
   I think using rank here is expensive and perhaps unnecessary? IIUC The rank 
is only used to compare two elements during merge sorting. Can we compare them 
by their values directly? Since `rank` internally also sorts the array, we are 
essentially sorting it twice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Reply via email to