edponce commented on a change in pull request #11233:
URL: https://github.com/apache/arrow/pull/11233#discussion_r720633466
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1132,52 +1132,51 @@ void AddMatchSubstring(FunctionRegistry* registry) {
{
auto func = std::make_shared<ScalarFunction>("match_substring",
Arity::Unary(),
&match_substring_doc);
- auto exec_32 = MatchSubstring<StringType, PlainSubstringMatcher>::Exec;
- auto exec_64 = MatchSubstring<LargeStringType,
PlainSubstringMatcher>::Exec;
- DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32,
MatchSubstringState::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, boolean(), exec_64,
MatchSubstringState::Init));
+ for (const auto& ty : BaseBinaryTypes()) {
+ auto exec =
+ GenerateTypeAgnosticVarBinaryBase<MatchSubstring,
PlainSubstringMatcher>(ty);
+ DCHECK_OK(func->AddKernel({ty}, boolean(), exec,
MatchSubstringState::Init));
+ }
DCHECK_OK(registry->AddFunction(std::move(func)));
}
{
auto func = std::make_shared<ScalarFunction>("starts_with", Arity::Unary(),
&match_substring_doc);
- auto exec_32 = MatchSubstring<StringType, PlainStartsWithMatcher>::Exec;
- auto exec_64 = MatchSubstring<LargeStringType,
PlainStartsWithMatcher>::Exec;
- DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32,
MatchSubstringState::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, boolean(), exec_64,
MatchSubstringState::Init));
+ for (const auto& ty : BaseBinaryTypes()) {
+ auto exec =
+ GenerateTypeAgnosticVarBinaryBase<MatchSubstring,
PlainStartsWithMatcher>(ty);
+ DCHECK_OK(func->AddKernel({ty}, boolean(), exec,
MatchSubstringState::Init));
+ }
Review comment:
*Note: binary = bytes, Binary = data type*
I tested RE2 with UTF8/Latin1 encodings using invalid ASCII/Latin/UTF8 and
found RE2 does not supports finding/matching/regex if patterns or data are
non-encoded bytes. Nevertheless, it does not triggers an error but fails to
match anything.
[It seems RE2 is limited to decoding representation of binary
numbers](https://github.com/google/re2/blob/main/re2/re2.h#L202).
Since RE2 is only used when `ignore_case` is enabled (and when all
best-effort attempts fail in `match_like`), I propose to trigger a
`NotImplemented` error when `ignore_case` is set for a Binary data type. Binary
and String types are discriminated by the [`Type::is_utf8`
field](https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L580).
The issue with this is that there are several string functions that only
register Binary kernels and this will trigger the error even if they are valid
encoded strings. To workaround this, we will require registering 4 kernels for
each string function ([Large]BinaryType, [Large]StringType), so no
_TypeAgnostic_ kernel execs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]