edponce commented on a change in pull request #11233:
URL: https://github.com/apache/arrow/pull/11233#discussion_r720698511



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1132,52 +1132,51 @@ void AddMatchSubstring(FunctionRegistry* registry) {
   {
     auto func = std::make_shared<ScalarFunction>("match_substring", 
Arity::Unary(),
                                                  &match_substring_doc);
-    auto exec_32 = MatchSubstring<StringType, PlainSubstringMatcher>::Exec;
-    auto exec_64 = MatchSubstring<LargeStringType, 
PlainSubstringMatcher>::Exec;
-    DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32, 
MatchSubstringState::Init));
-    DCHECK_OK(
-        func->AddKernel({large_utf8()}, boolean(), exec_64, 
MatchSubstringState::Init));
+    for (const auto& ty : BaseBinaryTypes()) {
+      auto exec =
+          GenerateTypeAgnosticVarBinaryBase<MatchSubstring, 
PlainSubstringMatcher>(ty);
+      DCHECK_OK(func->AddKernel({ty}, boolean(), exec, 
MatchSubstringState::Init));
+    }
     DCHECK_OK(registry->AddFunction(std::move(func)));
   }
   {
     auto func = std::make_shared<ScalarFunction>("starts_with", Arity::Unary(),
                                                  &match_substring_doc);
-    auto exec_32 = MatchSubstring<StringType, PlainStartsWithMatcher>::Exec;
-    auto exec_64 = MatchSubstring<LargeStringType, 
PlainStartsWithMatcher>::Exec;
-    DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32, 
MatchSubstringState::Init));
-    DCHECK_OK(
-        func->AddKernel({large_utf8()}, boolean(), exec_64, 
MatchSubstringState::Init));
+    for (const auto& ty : BaseBinaryTypes()) {
+      auto exec =
+          GenerateTypeAgnosticVarBinaryBase<MatchSubstring, 
PlainStartsWithMatcher>(ty);
+      DCHECK_OK(func->AddKernel({ty}, boolean(), exec, 
MatchSubstringState::Init));
+    }

Review comment:
       I will test RE2 standalone for binary non-encoded support.
   
   AFAIK, null bytes will always be hard to handle because it is ambiguous to 
identify if a null byte is part of the data or a null-terminating one. For 
example, consider a matching pattern like:
   ```c++
   MatchSubstringOptions(/*pattern=*/"\xff\x00\xaa");
   ```
   `pattern` is typed as `std::string` so `pattern.length()` will be incorrect 
(and if explicit length is used then it can result in UB bc of calls to 
`c_str()`). If `pattern` was typed as `const char*` it would require an 
explicit `length` which is unsafe because there is no way of validating if it 
matches the actual string. It seems we would need to use a non-encoding 
container such as an array (`uint8_t pattern[] = {...}`) to represent bytes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to