lidavidm commented on a change in pull request #11233:
URL: https://github.com/apache/arrow/pull/11233#discussion_r720699550
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1132,52 +1132,51 @@ void AddMatchSubstring(FunctionRegistry* registry) {
{
auto func = std::make_shared<ScalarFunction>("match_substring",
Arity::Unary(),
&match_substring_doc);
- auto exec_32 = MatchSubstring<StringType, PlainSubstringMatcher>::Exec;
- auto exec_64 = MatchSubstring<LargeStringType,
PlainSubstringMatcher>::Exec;
- DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32,
MatchSubstringState::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, boolean(), exec_64,
MatchSubstringState::Init));
+ for (const auto& ty : BaseBinaryTypes()) {
+ auto exec =
+ GenerateTypeAgnosticVarBinaryBase<MatchSubstring,
PlainSubstringMatcher>(ty);
+ DCHECK_OK(func->AddKernel({ty}, boolean(), exec,
MatchSubstringState::Init));
+ }
DCHECK_OK(registry->AddFunction(std::move(func)));
}
{
auto func = std::make_shared<ScalarFunction>("starts_with", Arity::Unary(),
&match_substring_doc);
- auto exec_32 = MatchSubstring<StringType, PlainStartsWithMatcher>::Exec;
- auto exec_64 = MatchSubstring<LargeStringType,
PlainStartsWithMatcher>::Exec;
- DCHECK_OK(func->AddKernel({utf8()}, boolean(), exec_32,
MatchSubstringState::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, boolean(), exec_64,
MatchSubstringState::Init));
+ for (const auto& ty : BaseBinaryTypes()) {
+ auto exec =
+ GenerateTypeAgnosticVarBinaryBase<MatchSubstring,
PlainStartsWithMatcher>(ty);
+ DCHECK_OK(func->AddKernel({ty}, boolean(), exec,
MatchSubstringState::Init));
+ }
Review comment:
I should also note, nulls are valid UTF-8 (it's codepoint zero), so in
testing, nulls don't actually test for handling of non-UTF8 data (we should do
something like a 3-byte codepoint that's been truncated), and we need to be
able to support nulls. For instance:
```python
>>> b"\x00".decode("utf-8")
'\x00'
>>> b"\x80".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]