bkietz commented on code in PR #43302:
URL: https://github.com/apache/arrow/pull/43302#discussion_r1722363911


##########
cpp/src/arrow/compute/kernels/scalar_cast_string.cc:
##########
@@ -305,19 +310,225 @@ BinaryToBinaryCastExec(KernelContext* ctx, const 
ExecSpan& batch, ExecResult* ou
       ctx, input, out->array_data().get());
 }
 
+// String View -> Offset String
+template <typename O, typename I>
+enable_if_t<is_binary_view_like_type<I>::value && 
is_base_binary_type<O>::value, Status>
+BinaryToBinaryCastExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* 
out) {
+  using OutputBuilderType = typename TypeTraits<O>::BuilderType;
+  const CastOptions& options = checked_cast<const 
CastState&>(*ctx->state()).options;
+  const ArraySpan& input = batch[0].array;
+
+  if constexpr (!I::is_utf8 && O::is_utf8) {
+    if (!options.allow_invalid_utf8) {
+      InitializeUTF8();
+      ArraySpanVisitor<I> visitor;
+      Utf8Validator validator;
+      RETURN_NOT_OK(visitor.Visit(input, &validator));

Review Comment:
   Oh, I think I see what you mean: we could similarly assemble larger 
contiguous byte ranges on which we run a single Utf8 validation pass.
   
   For the common case of views whose out-of-line data directly follows the 
previous out-of-line bytes, this would yield one long byte range for Utf8 
validation.
   
   Inline strings would also always be valid Utf8 since their size would 
consist of 3 zero bytes and one small byte plus the inline data and padding 
zero bytes, so we could validate on runs of inline views too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to