pitrou commented on a change in pull request #8621:
URL: https://github.com/apache/arrow/pull/8621#discussion_r560227216



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -186,16 +172,51 @@ struct UTF8Transform {
   }
 };
 
+#ifdef ARROW_WITH_UTF8PROC
+
+// transforms per codepoint
+template <typename Type, typename Derived>
+struct StringTransformCodepoint : StringTransform<Type, Derived> {
+  using Base = StringTransform<Type, Derived>;
+  using offset_type = typename Base::offset_type;
+
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+                 uint8_t* output, offset_type* output_written) {
+    uint8_t* output_start = output;
+    if (ARROW_PREDICT_FALSE(
+            !arrow::util::UTF8Transform(input, input + 
input_string_ncodeunits, &output,
+                                        Derived::TransformCodepoint))) {
+      return false;
+    }
+    *output_written = static_cast<offset_type>(output - output_start);
+    return true;
+  }
+  static int64_t MaxCodepoints(offset_type input_ncodeunits) {
+    // Section 5.18 of the Unicode spec claim that the number of codepoints 
for case
+    // mapping can grow by a factor of 3. This means grow by a factor of 3 in 
bytes
+    // However, since we don't support all casings (SpecialCasing.txt) the 
growth
+    // is actually only at max 3/2 (as covered by the unittest).
+    // Note that rounding down the 3/2 is ok, since only codepoints encoded by
+    // two code units (even) can grow to 3 code units.
+
+    return static_cast<int64_t>(input_ncodeunits) * 3 / 2;

Review comment:
       Now that I read this again, it strikes me that this function is 
estimating a number of codepoints, but we're using it to allocate a number of 
bytes (i.e. utf-8 codeunits). Is that ok?
   
   (and/or the function naming and comment is inconsistent)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to