pitrou commented on a change in pull request #11298:
URL: https://github.com/apache/arrow/pull/11298#discussion_r721342772
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -4230,6 +4294,33 @@ const FunctionDoc utf8_reverse_doc(
"composed of multiple codepoints."),
{"strings"});
+const FunctionDoc utf8_nfd_doc(
+ "Normalization Form Canonical Decomposition",
+ ("For each string in `strings`, return an unicode normalized version.\n\n"
+ "Characters are decomposed by canonical equivalence,\n"
+ "and multiple combining characters are arranged in a specific order.\n"),
+ {"strings"});
+
+const FunctionDoc utf8_nfkd_doc(
+ "Normalization Form Compatibility Decomposition",
+ ("For each string in `strings`, return an unicode normalized version.\n\n"
+ "Characters are decomposed by compatibility,\n"
+ "and multiple combining characters are arranged in a specific order.\n"),
+ {"strings"});
+
+const FunctionDoc utf8_nfc_doc(
+ "Normalization Form Canonical Composition",
+ ("For each string in `strings`, return an unicode normalized version.\n\n"
+ "Characters are decomposed and then recomposed by canonical
equivalence.\n"),
+ {"strings"});
+
+const FunctionDoc utf8_nfkc_doc(
+ "Normalization Form Compatibility Composition",
+ ("For each string in `strings`, return an unicode normalized version.\n\n"
+ "Characters are decomposed by compatibility,\n"
+ "then recomposed by canonical equivalence.\n"),
+ {"strings"});
Review comment:
Ok... API question: do we really want 4 distinct functions for this, or
would we rather have a single function "utf8_normalize" and a
Utf8NormalizationOptions class with an enum field describing the kind of
normalization?
In Python, there's a single function:
https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]