[jira] [Commented] (ARROW-12873) [C++][Compute] Support tagging ExecBatches with arbitrary extra information

Weston Pace (Jira) Fri, 06 Aug 2021 15:25:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395006#comment-17395006
 ]


Weston Pace commented on ARROW-12873:
-------------------------------------

I thought the original proposal was tagging record batches with arbitrary void* 
pointers.  It's possible I'm not explaining myself well.

If you'll allow some psuedocode here to avoid the complexity of exec plan...

What we have today is:
{code:python}
exec_batch_with_order_at_back = order_by_node(in_batch)
grouped_output = group_by_node(exec_batch_with_order_at_back, 
kernel_that_can_use_order)

def group_by_node(batch, agg_kernel):
  group_ids = grouper(batch)
  mashed_together_batch = {group_ids, batch}
  if can_use_order(agg_kernel):
    agg_kernel(mashed_together_batch, mashed_together_batch[-1])
  else:
    agg_kernel(mashed_together_batch)
{code}

I'm proposing (and this may not make any sense at all):
{code:python}
exec_batch, order = order_by_node(in_batch)
grouped_output = group_by_node(exec_batch, kernel_that_can_use_order, 
extra_inputs=[order])

def group_by_node(batch, agg_kernel, extra_inputs=[]):
  group_ids = grouper(batch)
  mashed_together_batch = {group_ids, batch}
  agg_kernel(mashed_together_batch, *extra_inputs)
{code}

The kernels still need different arities which I think is ok, but you don't 
have to do the branching.

Also, is there a reason (there probably is) we don't require aggregate kernels 
to be 2+ arity:
{code:python}
def group_by_node(batch, agg_kernel, extra_inputs=[]):
  group_ids = grouper(batch)
  agg_kernel(batch, group_ids, *extra_inputs)
{code}

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-12873
>                 URL: https://issues.apache.org/jira/browse/ARROW-12873
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for 
> tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, 
> since they may not originate from the arrow library. For an example within 
> the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and 
> a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} 
> of origin. However adding {{ExecBatch::fragment}} would result in a cyclic 
> dependency.
> To facilitate this tagging capability, we would need a type erased container 
> something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce<void(void*)> destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12873) [C++][Compute] Support tagging ExecBatches with arbitrary extra information

Reply via email to