zanmato1984 commented on code in PR #43782:
URL: https://github.com/apache/arrow/pull/43782#discussion_r1891412003


##########
cpp/src/arrow/compute/kernels/scalar_cast_nested.cc:
##########
@@ -404,14 +400,20 @@ struct CastStruct {
 
     int out_field_index = 0;
     for (int field_index : fields_to_select) {
-      const auto& target_type = out->type()->field(out_field_index++)->type();
+      const auto& target_field = out->type()->field(out_field_index++);
+      const auto& target_type = target_field->type();
       if (field_index == kFillNullSentinel) {
         ARROW_ASSIGN_OR_RAISE(auto nulls,
                               MakeArrayOfNull(target_type->GetSharedPtr(), 
batch.length));
         out_array->child_data.push_back(nulls->data());
       } else {
         const auto& values = 
(in_array.child_data[field_index].ToArrayData()->Slice(
             in_array.offset, in_array.length));
+        if (!target_field->nullable() && values->null_count > 0) {

Review Comment:
   > I have never heard of kUnknownNullCount. I'm not sure what we should do 
here, I think I need to know a lot more background. Can someone who has 
experience with null counts help?
   
   Sorry I'm not sure if there is a well-documented rule of dealing with 
`kUnknownNullCount`, maybe @pitrou does? But in practice, I (and many other 
developers I suppose) just keep in mind that the null count can be unknown and 
is lazily computed. One can enforce such computation once the null count is 
absolutely needed (there are various APIs helping on this).
    
   > 1. This sounds like we sometimes don't know the null count, I assume 
because it is expensive to compute. Would this be a good reason to actually 
force computation to happen (behavior A), or should we be conservative, and if 
we find kUnkownNullCount, then we assume that there could be nulls and don't 
let the cast happen and therefore error (behavior B). Behavior A seems much 
more correct to me.
   
   Yeah your assumption is right. And I would go A (enforcing the null count 
computation) because IMO this is essential for what this PR is trying to do. 
Plus, null count is unknown by default, so B will make this change almost 
useless.
   
   > 2. I don't think this function should have the responsibility of looking 
at the child arrays null counts. Should we add some higher level API to arrays 
like nullCountOrChildNullCount() that actually does this messy work??
   
   IMO this is where you must make contact with child arrays, given that you 
are in the context of dealing with a `Struct` array - which is essentially a 
container of child arrays. And the original code already does so. Defining an 
API like `nullCountOrChildNullCount()` could be semantically hard/useless 
because: 1) the validity (null or not) of the whole struct and 2) the validity 
of each individual field (and you have to specify which field), mean very 
different things. So the null count of a particular field is a property of this 
particular field only and should be computed/checked individually.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to