Re: ExecBatch in arrow execution engine

David Li Mon, 09 May 2022 12:50:49 -0700

Also see this related discussion, which petered out: 
https://issues.apache.org/jira/browse/ARROW-12873


On Mon, May 9, 2022, at 15:40, Weston Pace wrote:
> Any kind of "batch-level" information is a little tricky in the
> execution engine because nodes are free to chop up and recombine
> batches as they see fit.  For example, the output of a join node is
> going to contain data from at least two different input batches.  Even
> nodes with a single input and single output could be splitting batches
> into smaller work items or accumulating batches into larger work
> items.  A few thoughts come to mind:
>
> Does the existing filter "guarantee" mechanism work for you?  Each
> batch can be attached an expression which is guaranteed to be true.
> The filter node uses this expression to simplify the filter it needs
> to apply.  For example, if your custom scanner determines that `x >
> 50` is always true then that can be attached as a guarantee.  Later,
> if you need to apply the filter `x < 30` then the filter node knows it
> can exclude the entire batch based on the guarantee.  However, the
> guarantee suffers from the above described "batch-level" problems
> (e.g. a join node will not include guarantees in the output).
>
> Can you attach your metadata as an actual column using a scalar?  This
> is what we do with the __filename column today.
>
> On Mon, May 9, 2022 at 5:24 AM Yaron Gvili <rt...@hotmail.com> wrote:
>>
>> Hi Yue,
>>
>> From my limited experience with the execution engine, my understanding is 
>> that the API allows streaming only an ExecBatch from one node to another. A 
>> possible solution is to derive from ExecBatch your own class (say) 
>> RichExecBatch that carries any extra metadata you want. If in your execution 
>> plan, each node that expects to receive a RichExecBatch gets it directly 
>> from a sending node that makes it (both of which you could implement), then 
>> I think this could work and may be enough for your use case. However, note 
>> that when there are intermediate nodes in between such sending and receiving 
>> nodes, this may well break because an intermediate node could output a fresh 
>> ExecBatch even when receiving a RichExecBatch as input, like filter_node 
>> does [1], for example.
>>
>> [1] 
>> https://github.com/apache/arrow/blob/35119f29b0e0de68b1ccc5f2066e0cc7d27fddd0/cpp/src/arrow/compute/exec/filter_node.cc#L98
>>
>>
>> Yaron.
>>
>> ________________________________
>> From: Yue Ni <niyue....@gmail.com>
>> Sent: Monday, May 9, 2022 10:28 AM
>> To: dev@arrow.apache.org <dev@arrow.apache.org>
>> Subject: ExecBatch in arrow execution engine
>>
>> Hi there,
>>
>> I would like to use apache arrow execution engine for some computation. I
>> found `ExecBatch` instead of `RecordBatch` is used for execution engine's
>> node, and I wonder how I can attach some additional information such as
>> schema/metadata for the `ExecBatch` during execution so that they can be
>> used by a custom ExecNode.
>>
>> In my first use case, the computation flow looks like this:
>>
>> scanner <===> custom filter node <===> query client
>>
>> 1) The scanner is a custom scanner that will load some data from disk, and
>> it accepts a pushed down custom filter expression (not the arrow filter
>> expression but a homebrewed filter expression), and the scanner will use
>> this custom filter expression to avoid loading data from disk as much as
>> possible but it may return a superset of matching data to the successor
>> nodes because the capability of pushed down filter.
>>
>> 2) And its successor node is a filter node, which will do some additional
>> filtering if needed. The scanner is aware that if a result batch retrieved
>> needs additional filtering or not, and I would like to make scanner pass
>> some batch specific metadata like "additional_filtering_required:
>> true/false" along with the batch to the filter node, but I cannot figure
>> out how this could be done for the `ExecBatch`.
>>
>> In my other use case, I would like to attach a batch specific schema to
>> each batch returned by some nodes.
>>
>> Basically, I wonder within the current framework, if there is any chance I
>> could attach some additional execution metadata/schema to the `ExecBatch`
>> so that they could be used by a custom exec node. Could you please help?
>> Thanks.

Re: ExecBatch in arrow execution engine

Reply via email to