[GitHub] [arrow] tyrelr commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors

GitBox Fri, 22 Jan 2021 17:53:15 -0800


tyrelr commented on a change in pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#discussion_r562999845




##########
File path: rust/arrow/src/array/array_primitive.rs
##########
@@ -86,13 +86,9 @@ impl<T: ArrowPrimitiveType> PrimitiveArray<T> {
     }
 
     /// Returns the primitive value at index `i`.
-    ///
-    /// Note this doesn't do any bound checking, for performance reason.
-    /// # Safety
-    /// caller must ensure that the passed in offset is less than the array 
len()
+    #[inline]
     pub fn value(&self, i: usize) -> T::Native {
-        let offset = i + self.offset();
-        unsafe { *self.raw_values.as_ptr().add(offset) }
+        self.values()[i]

Review comment:
       #9291 is good progress towards eliminating use of the function.
   
   And certainly, we could 'split' the macro for primitives as a quick fix to 
get rid of the call to the function.  I've been experimenting with an 
alternative approach that might be a bit more flexible to multiple use cases, 
described at the bottom of this comment.
   
   I am quite torn about whether I think value should or should not be in the 
interface.
   # Reasons to drop value(i) -> T::Native
   I think that even if value(i) was dropped from the PrimitiveArray impl's, 
efficient random access to items without a bounds check can still be achieved 
through ```unsafe{*primitive_array.values().get_unchecked(i)}```  (the extra * 
because get_unchecked() returns a ref to the value).
   
   I'm not sure I have any example code or measurements to demonstrate it on 
hand, but I am certain I saw the silently-unsafe 
implementation```x.values().iter().zip(y.values().iter())``` did (slightly) 
outperform ```(0..x.len()).map(|i|{x.value(i),y.value(i)}```.  I believe it was 
when I was playing with non-simd arithmetic kernels....  So that is the root of 
my hesitancy, is I'm worried it doesn't actually escape any overhead, and 
unintentionally lead people away from a more reliable/performant way.  IF there 
is a context where unsafe{x.value(i)} beats the performance of 
unsafe{*x.values().get_unchecked(i)}
   
   # Reasons to keep value(i) -> T::Native
   All other array implementations have value functions as far as I recall, so 
it is a nice 'consistency'.  
   
   In the back of my mind, the biggest argument to keep value(i) is for api 
consistency... so long term, a 'trait' may be the place where it might fit 
best?  Very roughly, I'm thinking:
   ```
   trait TypedArrowArray : ArrowArray {
      type RefType;
      fn is_valid(i:usize) -> bool; //bounds check
      unsafe fn is_valid(i:usize) -> bool; //no bounds check
      fn value(i:usize) -> RefType;  //bounds check
      unsafe fn value_unchecked(i:usize) -> RefType; //no bounds checked
      fn iter() -> impl Iterator<Option<RefType>>;
      fn iter_values() -> impl Iterator<RefType>;
   }
   impl <T: ArrowPrimitiveType> TypedArrowArray<&T::Native> for 
PrimitiveArray<T> { ... }
   impl TypedArrowArray<ArrayRef> for GenericListArray<T> { ... }
   //and similar for string/binary. ... I am not sure whether struct arrays 
could fit... Dictionary would not give access to 'keys', only to the values 
referenced by each key?  Union would require some kind of RefType that can 
downcast into the actual value?
   ```
   Of course, I am uncertain how much overhead the 'standarization' such a 
trait impl implies would bring...  would any kernels actually benefit from 
using generic implementations against such an api, or will they always go down 
to the concrete type to squeeze little short-cuts out that don't fit in the 
generic interface?  I'm unsure, so very (very, very) slowly experimenting...
   
   # Summary
   
   So in short, my thoughts are:
   * I think that leaving value(i) safety consideration out of this PR makes 
sense.  I've rebased to drop that out - although I did leave the additional 
values() test code.
   * Marking it unsafe in the near future is absolutely better than being 
silently-unsafe.  The argument that adding bounds-checks could silently impact 
external users is reasonable, taking unsafe has the larger 'warning' so that 
the change isn't missed.
   * Longer term, the options of deprecating it, or explicitly moving it into 
an trait impl are both contenders in my mind... but neither option is directly 
relevant to this PR.
   
   Let me know if that seems reasonable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] tyrelr commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors

Reply via email to