Re: [I] [C++] A more efficient "case_when" specialization for list-view types [arrow]

via GitHub Wed, 05 Jun 2024 11:50:13 -0700


felipecrv commented on issue #41453:
URL: https://github.com/apache/arrow/issues/41453#issuecomment-2150735331


   Expanding. Let's say we have the following relational query formula that 
produces a list-view typed column from list-views `a`, `b`, and `c`:
   
   ```sql
   CASE
       WHEN list_value_length(a) > 1 THEN list_slice(a, 1)
       WHEN list_value_length(b) > 1 THEN list_slice(b, 1)
       ELSE coalesce(c, [0])
   END AS resulting_list
   ```
   
   In a LISP-like intermediate representation, this can be written as:
   
   ```lisp
   (cond
     (> (list_value_length a) 1) (list_slice a 1)
     (> (list_value_length b) 1) (list_slice b 1)
     else     (coalesce b [0]))
   ```
   
   Due to how expressions are currently represented in `arrow::compute` we are 
forced to eagerly eval all the conditions and branches. Even when the condition 
is false for that branch. We need `cond` to be a special-form [1] meaning that 
all expressions are passed "by name" [2].
   
   The evaluation of the expressions for this example should happen in this 
order:
   
   ```cpp
   auto output = pre_allocate_list_view(batch.length);
   
   // builds an array with the lengths of a
   auto length_a = list_value_length(a);  
   // builds the first pair of sel. vectors (for the first condition)
   auto sel0_true, sel0_false = greater_than(length_a, 1);
   
   // evaluate the first expression only on positions in sel0_true.
   // since `output` is a list-view, all these random positions in the
   // output can be written now before the other branches write to
   // their positions.
   // (we pass sel0_true to the list_slice kernel as an optional parameter)
   list_slice[sel0_true](a, 1, &output);
   
   // builds an array with the lengths of b
   // (note that now we pass the sel0_false selection vector because
   // we only have to run `list_value_length` on the positions where the
   // previous conditions were false)
   auto length_b = list_value_length[sel0_false](b);
   // builds the next pair of selection vectors
   // (note that greater_than[sel0_false] only checks the relevant positions,
   // consequently, sel1_true/false are going to be smaller than the first
   // pair of selection vectors)
   auto sel1_true, sel1_false = greater_than[sel0_false](length_a, 1);
   
   // evaluate the second branch's expression. none of the positions
   // in sel1_true were present in sel0_true, so we're only writing on
   // values that weren't written before
   list_slice[sel1_true](b, 1, &output);
   
   // now evaluate the else case with the last negative selection vector
   coalesce[sel1_false](b, [0])
   ```
   
   
   [1] https://github.com/apache/arrow/issues/41094#issuecomment-2087716483
   [2] [Call by 
name](https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_name) (easily 
and often confused as "lazy evaluation")


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] A more efficient "case_when" specialization for list-view types [arrow]

Reply via email to