edponce commented on a change in pull request #11233:
URL: https://github.com/apache/arrow/pull/11233#discussion_r717247336
##########
File path: cpp/src/arrow/compute/kernels/codegen_internal.h
##########
@@ -1188,6 +1188,24 @@ ArrayKernelExec
GenerateTypeAgnosticVarBinaryBase(detail::GetTypeId get_id) {
}
}
+// similar to GenerateTypeAgnosticPrimitive, but for variable binary and
string types
+template <template <typename...> class Generator, typename... Args>
+ArrayKernelExec GenerateTypeAgnosticVarBinary(detail::GetTypeId get_id) {
Review comment:
Yes, I think we should standardize better the naming convention for all
GDs (not only for binary/string types). Maybe we can revisit/resolve the naming
convention in [ARROW-14134](https://issues.apache.org/jira/browse/ARROW-14134).
The scalar binary kernel exec generators use `EqualTypes` to refer to same
input types, so we should use another term of input/output type matching. I
think `isomorphic` is the correct terminology to use for the non-type agnostic
case.
Another detail is the use of `Var`, I think it should be removed since
`Binary` and `String` types imply variable-width. Instead the GDs for
`FixedSizeBinary` should use `FixedSize` in its name. This should be resolved
in naming convention PR.
For this PR, what about using the following naming conventions for the
variable-sized binary/string GDs?
**Note: That the only GD name being changed is the one added in this PR.**
* `GenerateTypeAgnosticVarBinaryBase`
* no explicit output type -> output type is binary type
* combines binary/string kernels
* GenerateVarBinaryBase
* use `Type0`
* combines binary/string kernels
* GenerateVarBinary
* use `Type0`
* independent binary/string kernels
* GenerateIsomorphicVarBinary
* no explicit output type -> output type same as input type
* independent binary/string kernels
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2201,11 +2201,11 @@ void AddSplitWhitespaceAscii(FunctionRegistry*
registry) {
auto func =
std::make_shared<ScalarFunction>("ascii_split_whitespace",
Arity::Unary(),
&ascii_split_whitespace_doc,
&default_options);
- using t32 = SplitWhitespaceAsciiExec<StringType, ListType>;
- using t64 = SplitWhitespaceAsciiExec<LargeStringType, ListType>;
- DCHECK_OK(func->AddKernel({utf8()}, {list(utf8())}, t32::Exec,
t32::State::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, {list(large_utf8())}, t64::Exec,
t64::State::Init));
+
+ for (const auto& ty : StringTypes()) {
+ auto exec = GenerateTypeAgnosticVarBinary<SplitWhitespaceAsciiExec,
ListType>(ty);
+ DCHECK_OK(func->AddKernel({ty}, {list(ty)}, exec, SplitState::Init));
+ }
Review comment:
Based on [Zulip
discussion](https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Stringlike.20kernels.20on.20binary.20data),
we can support binary types for ASCII-named compute functions.
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2201,11 +2201,11 @@ void AddSplitWhitespaceAscii(FunctionRegistry*
registry) {
auto func =
std::make_shared<ScalarFunction>("ascii_split_whitespace",
Arity::Unary(),
&ascii_split_whitespace_doc,
&default_options);
- using t32 = SplitWhitespaceAsciiExec<StringType, ListType>;
- using t64 = SplitWhitespaceAsciiExec<LargeStringType, ListType>;
- DCHECK_OK(func->AddKernel({utf8()}, {list(utf8())}, t32::Exec,
t32::State::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, {list(large_utf8())}, t64::Exec,
t64::State::Init));
+
+ for (const auto& ty : StringTypes()) {
+ auto exec = GenerateTypeAgnosticVarBinary<SplitWhitespaceAsciiExec,
ListType>(ty);
+ DCHECK_OK(func->AddKernel({ty}, {list(ty)}, exec, SplitState::Init));
+ }
Review comment:
Based on [Zulip
discussion](https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Stringlike.20kernels.20on.20binary.20data),
we may support binary types for ASCII-named compute functions.
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2201,11 +2201,11 @@ void AddSplitWhitespaceAscii(FunctionRegistry*
registry) {
auto func =
std::make_shared<ScalarFunction>("ascii_split_whitespace",
Arity::Unary(),
&ascii_split_whitespace_doc,
&default_options);
- using t32 = SplitWhitespaceAsciiExec<StringType, ListType>;
- using t64 = SplitWhitespaceAsciiExec<LargeStringType, ListType>;
- DCHECK_OK(func->AddKernel({utf8()}, {list(utf8())}, t32::Exec,
t32::State::Init));
- DCHECK_OK(
- func->AddKernel({large_utf8()}, {list(large_utf8())}, t64::Exec,
t64::State::Init));
+
+ for (const auto& ty : StringTypes()) {
+ auto exec = GenerateTypeAgnosticVarBinary<SplitWhitespaceAsciiExec,
ListType>(ty);
+ DCHECK_OK(func->AddKernel({ty}, {list(ty)}, exec, SplitState::Init));
+ }
Review comment:
Created [ARROW-14151](https://issues.apache.org/jira/browse/ARROW-14151)
for the ASCII functions.
##########
File path: cpp/src/arrow/compute/kernels/codegen_internal.h
##########
@@ -1188,6 +1188,24 @@ ArrayKernelExec
GenerateTypeAgnosticVarBinaryBase(detail::GetTypeId get_id) {
}
}
+// similar to GenerateTypeAgnosticPrimitive, but for variable binary and
string types
+template <template <typename...> class Generator, typename... Args>
+ArrayKernelExec GenerateTypeAgnosticVarBinary(detail::GetTypeId get_id) {
Review comment:
Ok, I will keep it simple for this PR and use the `Returning` variant.
I think the using the suffix of the form `ToType` is good. Better explicit
and verbose than short and confusing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]