bjchambers commented on issue #514:
URL: https://github.com/apache/arrow-rs/issues/514#issuecomment-873427233


   I think the problem here may be more widespread than I originall thought. It 
looks like the equality methods are inconsistent in how they handle offset. For 
instance:
   
   ```
   pub fn equal(lhs: &ArrayData, rhs: &ArrayData) -> bool {
       let lhs_nulls = lhs.null_buffer();
       let rhs_nulls = rhs.null_buffer();
       utils::base_equal(lhs, rhs)
           && lhs.null_count() == rhs.null_count()
           && utils::equal_nulls(lhs, rhs, lhs_nulls, rhs_nulls, 0, 0, 
lhs.len())
           && equal_values(lhs, rhs, lhs_nulls, rhs_nulls, 0, 0, lhs.len())
   }
   ```
   
   This starts has `lhs_start` and `rhs_start` as 0. This would suggest that 
`equal_nulls` and `equal_values` should add the start to the offset.
   
   ```
   #[inline]
   pub(super) fn equal_nulls(
       lhs: &ArrayData,
       rhs: &ArrayData,
       lhs_nulls: Option<&Buffer>,
       rhs_nulls: Option<&Buffer>,
       lhs_start: usize,
       rhs_start: usize,
       len: usize,
   ) -> bool {
       let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
       let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
       if lhs_null_count > 0 || rhs_null_count > 0 {
           let lhs_values = lhs_nulls.unwrap().as_slice();
           let rhs_values = rhs_nulls.unwrap().as_slice();
           equal_bits(
               lhs_values,
               rhs_values,
               lhs_start + lhs.offset(),
               rhs_start + rhs.offset(),
               len,
           )
       } else {
           true
       }
   }
   ```
   
   This uses the `lhs_start` directly for counting nulls (which doesn't add the 
offset), but then adds the offset in when calling `equal_bits`. Looking at 
other parts of equality (such as boolean, primitives, etc.) it seems that it is 
inconsistent as to whether the offset should be included or not in the start 
values.
   
   I can a stab at fixing it for at least the case identified, but I'm somewhat 
curious what the intended approach is. It seems like there are two options:
   
   1. The `lhs_start` and `rhs_start` already include the offset, so none of 
the helper methods should add it again. This seems like it could include the 
offset just once (up front) and then everything else wouldn't need to, but it 
may run into more problems if some of the helper methods do include the offset 
(eg., running off the end of the array, etc.). Also may make it harder to use 
the `lhs_start` and `rhs_start` with public methods on the array data that 
respect offset.
   2. The `lhs_start` and `rhs_start` are always relative the corresponding 
offset. When indexing into the data using methods that aren't relative the 
offset, it needs to be added.
   
   It seems like option 2 is most consistent with how things are currently 
implemented. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to