brunal opened a new pull request, #9448:
URL: https://github.com/apache/arrow-rs/pull/9448

   This MR implements efficient eq, neq, distinct, not distinct, gt, lt, ... 
for 2 RunArrays with the same DataTypes & length.
   
   The idea is to:
   
   1. Compute all values indices where the comparison must be performed.
   
   This is the union of the run-ends
   
   For example, given 2 RunArray with run-end values:
         [3, 4, 10]
     and [2, 5, 10]
   
   The intersection of their run-ends is
         [2, 3, 4, 5, 10]
   
   The corresponding indices of the values array of each RunArray are:
         [0, 0, 1, 2, 2]
     and [0, 1, 1, 1, 2]
   
   2. Use apply_op_vectored() to perform the operation on the values arrays at 
those indices.
   
   3. Finally take nulls into account.
   
   4. Build a BooleanArray from the result + the null mask.
   
   Implementation thoughts:
   
   A. Returning a RunArray instead of a BooleanArray would be interesting. This 
can be more efficient: a RunArray (with values being a BooleanBuffer) would 
have a length in [1; len(input RunArray) * 2] and can be efficiently 
constructed. This would require introducing new pub functions: 
distinct_run_array, eq_run_array, etc.
   
   B. The operation is performed on all indices before looking at the nulls. 
With sparse (null-heavy) arrays this is wasteful. It might be worth skipping 
the computation when either side is null and then splicing results from 
non-null and null indices.
   
   C. There's a bit of copy-paste for downcast_primitive_array!() usage. I 
could only skip that by introducing a new macro, which didn't seem desirable.
   
   D. I find the lack of a value type for a fully typed run array annoying. 
Array an RunArray<I> are value types, but TypedRunArray<'_, I, V> is a 
reference type. This is frustrating. Some type contracts are only comments, and 
not enforced by the type system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to