alamb commented on code in PR #17637:
URL: https://github.com/apache/datafusion/pull/17637#discussion_r2364094340
##########
datafusion/common/src/nested_struct.rs:
##########
@@ -162,6 +185,43 @@ pub fn cast_column(
}
}
+/// Sanitizes a `BinaryView` array so that any element containing invalid UTF-8
+/// is converted to null before casting to a UTF-8 string array.
+///
+/// This only transforms the array's values (not any external statistics).
Other
+/// binary array representations are returned unchanged because Arrow's safe
+/// casts already convert invalid UTF-8 sequences to null for those types.
+pub fn sanitize_binary_array_for_utf8(array: ArrayRef) -> ArrayRef {
+ match array.data_type() {
+ DataType::BinaryView => {
+ let binary_view = array.as_binary_view();
+
+ // Check if all bytes are already valid UTF-8
Review Comment:
This seems redundant with calling
https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html with
`CastOptions::safe = true` 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]