tyrelr commented on a change in pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#discussion_r562999845
##########
File path: rust/arrow/src/array/array_primitive.rs
##########
@@ -86,13 +86,9 @@ impl<T: ArrowPrimitiveType> PrimitiveArray<T> {
}
/// Returns the primitive value at index `i`.
- ///
- /// Note this doesn't do any bound checking, for performance reason.
- /// # Safety
- /// caller must ensure that the passed in offset is less than the array
len()
+ #[inline]
pub fn value(&self, i: usize) -> T::Native {
- let offset = i + self.offset();
- unsafe { *self.raw_values.as_ptr().add(offset) }
+ self.values()[i]
Review comment:
#9291 is good progress towards eliminating use of the function.
And certainly, we could 'split' the macro for primitives as a quick fix to
get rid of the call to the function. I've been experimenting with an
alternative approach that might be a bit more flexible to multiple use cases,
described at the bottom of this comment.
I am quite torn about whether I think value should or should not be in the
interface.
# Reasons to drop value(i) -> T::Native
I think that even if value(i) was dropped from the PrimitiveArray impl's,
efficient random access to items without a bounds check can still be achieved
through ```unsafe{*primitive_array.values().get_unchecked(i)}``` (the extra *
because get_unchecked() returns a ref to the value).
I'm not sure I have any example code or measurements to demonstrate it on
hand, but I am certain I saw the silently-unsafe
implementation```x.values().iter().zip(y.values().iter())``` did (slightly)
outperform ```(0..x.len()).map(|i|{x.value(i),y.value(i)}```. I believe it was
when I was playing with non-simd arithmetic kernels.... So that is the root of
my hesitancy, is I'm worried it doesn't actually escape any overhead, and
unintentionally lead people away from a more reliable/performant way. IF there
is a context where unsafe{x.value(i)} beats the performance of
unsafe{*x.values().get_unchecked(i)}
# Reasons to keep value(i) -> T::Native
All other array implementations have value functions as far as I recall, so
it is a nice 'consistency'.
In the back of my mind, the biggest argument to keep value(i) is for api
consistency... so long term, a 'trait' may be the place where it might fit
best? Very roughly, I'm thinking:
```
trait TypedArrowArray : ArrowArray {
type RefType;
fn is_valid(i:usize) -> bool; //bounds check
unsafe fn is_valid(i:usize) -> bool; //no bounds check
fn value(i:usize) -> RefType; //bounds check
unsafe fn value_unchecked(i:usize) -> RefType; //no bounds checked
fn iter() -> impl Iterator<Option<RefType>>;
fn iter_values() -> impl Iterator<RefType>;
}
impl <T: ArrowPrimitiveType> TypedArrowArray<&T::Native> for
PrimitiveArray<T> { ... }
impl TypedArrowArray<ArrayRef> for GenericListArray<T> { ... }
//and similar for string/binary. ... I am not sure whether struct arrays
could fit... Dictionary would not give access to 'keys', only to the values
referenced by each key? Union would require some kind of RefType that can
downcast into the actual value?
```
Of course, I am uncertain how much overhead the 'standarization' such a
trait impl implies would bring... would any kernels actually benefit from
using generic implementations against such an api, or will they always go down
to the concrete type to squeeze little short-cuts out that don't fit in the
generic interface? I'm unsure, so very (very, very) slowly experimenting...
# Summary
So in short, my thoughts are:
* I think that leaving value(i) safety consideration out of this PR makes
sense. I've rebased to drop that out - although I did leave the additional
values() test code.
* Marking it unsafe in the near future is absolutely better than being
silently-unsafe. The argument that adding bounds-checks could silently impact
external users is reasonable, taking unsafe has the larger 'warning' so that
the change isn't missed.
* Longer term, the options of deprecating it, or explicitly moving it into
an trait impl are both contenders in my mind... but neither option is directly
relevant to this PR.
Let me know if that seems reasonable.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]