[GitHub] [arrow] pitrou commented on a change in pull request #11298: ARROW-14205: [C++] Add unicode normalization to scalar string

GitBox Wed, 03 Nov 2021 07:16:50 -0700


pitrou commented on a change in pull request #11298:
URL: https://github.com/apache/arrow/pull/11298#discussion_r741984147




##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -672,11 +678,64 @@ struct Utf8TitleTransform : public 
FunctionalCaseMappingTransform {
 template <typename Type>
 using Utf8Title = StringTransformExec<Type, Utf8TitleTransform>;
 
+struct Utf8NormalizeTransform : public FunctionalCaseMappingTransform {
+  using State = OptionsWrapper<Utf8NormalizeOptions>;
+
+  const Utf8NormalizeOptions* options;
+
+  explicit Utf8NormalizeTransform(const Utf8NormalizeOptions& options)
+      : options{&options} {}
+
+  int64_t MaxCodeunits(const uint8_t* input, int64_t ninputs,
+                       int64_t input_ncodeunits) override {
+    const auto option = GenerateUtf8NormalizeOption(options->method);
+    const auto n_chars =
+        utf8proc_decompose_custom(input, input_ncodeunits, NULL, 0, option, 
NULL, NULL);
+
+    // convert to byte length
+    return n_chars * 4;

Review comment:
       Ok, I don't know how much we want to optimize this, but we're calling 
the normalization function twice: once here to compute the output size, once in 
`Transform` to actually output the data. It seems wasteful (especially as I 
don't think normalization is especially fast).
   
   Also, the heuristic of multiplying by 4 to get the number of output bytes is 
a bit crude.
   
   My preference would be for this kernel to take an another approach and 
simply grow the output buffer as needed (you can presize it with the input data 
length as a heuristic).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a change in pull request #11298: ARROW-14205: [C++] Add unicode normalization to scalar string

Reply via email to