edponce commented on a change in pull request #11233:
URL: https://github.com/apache/arrow/pull/11233#discussion_r720698511
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1132,52 +1132,51 @@ void AddMatchSubstring(FunctionRegistry* registry) {
{
auto func = std::make_shared<ScalarFunction>("match_substring",
Arity::Unary(),
&match_substring_doc);
- auto exec_32 = MatchSubstring<StringType, PlainSubstringMatcher>::Exec;
- auto exec_64 = MatchSubstring<LargeStringType,
PlainSubstringMatcher>::Exec;
- DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32,
MatchSubstringState::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, boolean(), exec_64,
MatchSubstringState::Init));
+ for (const auto& ty : BaseBinaryTypes()) {
+ auto exec =
+ GenerateTypeAgnosticVarBinaryBase<MatchSubstring,
PlainSubstringMatcher>(ty);
+ DCHECK_OK(func->AddKernel({ty}, boolean(), exec,
MatchSubstringState::Init));
+ }
DCHECK_OK(registry->AddFunction(std::move(func)));
}
{
auto func = std::make_shared<ScalarFunction>("starts_with", Arity::Unary(),
&match_substring_doc);
- auto exec_32 = MatchSubstring<StringType, PlainStartsWithMatcher>::Exec;
- auto exec_64 = MatchSubstring<LargeStringType,
PlainStartsWithMatcher>::Exec;
- DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32,
MatchSubstringState::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, boolean(), exec_64,
MatchSubstringState::Init));
+ for (const auto& ty : BaseBinaryTypes()) {
+ auto exec =
+ GenerateTypeAgnosticVarBinaryBase<MatchSubstring,
PlainStartsWithMatcher>(ty);
+ DCHECK_OK(func->AddKernel({ty}, boolean(), exec,
MatchSubstringState::Init));
+ }
Review comment:
I will test RE2 standalone for binary non-encoded support.
AFAIK, null bytes will always be hard to handle because it is ambiguous to
identify if a null byte is part of the data or a null-terminating one. For
example, consider a matching pattern like:
```c++
MatchSubstringOptions(/*pattern=*/"\xff\x00\xaa");
```
`pattern` is typed as `std::string` so `pattern.length()` will be incorrect.
If `pattern` was typed as `const char*` it would require an explicit `length`
which is unsafe because there is no way of validating if it matches the actual
string. It seems we would need to use a non-encoding container such as an array
(`uint8_t pattern[] = {...}`) to represent bytes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]