edponce commented on a change in pull request #11233:
URL: https://github.com/apache/arrow/pull/11233#discussion_r720633466



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1132,52 +1132,51 @@ void AddMatchSubstring(FunctionRegistry* registry) {
   {
     auto func = std::make_shared<ScalarFunction>("match_substring", 
Arity::Unary(),
                                                  &match_substring_doc);
-    auto exec_32 = MatchSubstring<StringType, PlainSubstringMatcher>::Exec;
-    auto exec_64 = MatchSubstring<LargeStringType, 
PlainSubstringMatcher>::Exec;
-    DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32, 
MatchSubstringState::Init));
-    DCHECK_OK(
-        func->AddKernel({large_utf8()}, boolean(), exec_64, 
MatchSubstringState::Init));
+    for (const auto& ty : BaseBinaryTypes()) {
+      auto exec =
+          GenerateTypeAgnosticVarBinaryBase<MatchSubstring, 
PlainSubstringMatcher>(ty);
+      DCHECK_OK(func->AddKernel({ty}, boolean(), exec, 
MatchSubstringState::Init));
+    }
     DCHECK_OK(registry->AddFunction(std::move(func)));
   }
   {
     auto func = std::make_shared<ScalarFunction>("starts_with", Arity::Unary(),
                                                  &match_substring_doc);
-    auto exec_32 = MatchSubstring<StringType, PlainStartsWithMatcher>::Exec;
-    auto exec_64 = MatchSubstring<LargeStringType, 
PlainStartsWithMatcher>::Exec;
-    DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32, 
MatchSubstringState::Init));
-    DCHECK_OK(
-        func->AddKernel({large_utf8()}, boolean(), exec_64, 
MatchSubstringState::Init));
+    for (const auto& ty : BaseBinaryTypes()) {
+      auto exec =
+          GenerateTypeAgnosticVarBinaryBase<MatchSubstring, 
PlainStartsWithMatcher>(ty);
+      DCHECK_OK(func->AddKernel({ty}, boolean(), exec, 
MatchSubstringState::Init));
+    }

Review comment:
       *Note: binary = bytes, Binary = data type*
   
   I tested RE2 with UTF8/Latin1 encodings using invalid ASCII/Latin/UTF8 and 
found RE2 does not supports finding/matching/regex if patterns or data are 
non-encoded bytes. Nevertheless, it does not triggers an error but fails to 
match anything.
   
   [It seems RE2 is limited to decoding representation of binary 
numbers](https://github.com/google/re2/blob/main/re2/re2.h#L202).
   
   Since RE2 is only used when `ignore_case` is enabled (and when all 
best-effort attempts fail in `match_like`), I propose to trigger a 
`NotImplemented` error when `ignore_case` is set for a Binary data type. Binary 
and String types are discriminated by the [`Type::is_utf8` 
field](https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L580). 
The issue with this is that there are several string functions that only 
register Binary kernels and this will trigger the error even if they are valid 
encoded strings. To workaround this, we will require registering 4 kernels for 
each string function ([Large]BinaryType, [Large]StringType), so no 
_TypeAgnostic_ kernel execs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to