Also see this related discussion, which petered out: https://issues.apache.org/jira/browse/ARROW-12873
On Mon, May 9, 2022, at 15:40, Weston Pace wrote: > Any kind of "batch-level" information is a little tricky in the > execution engine because nodes are free to chop up and recombine > batches as they see fit. For example, the output of a join node is > going to contain data from at least two different input batches. Even > nodes with a single input and single output could be splitting batches > into smaller work items or accumulating batches into larger work > items. A few thoughts come to mind: > > Does the existing filter "guarantee" mechanism work for you? Each > batch can be attached an expression which is guaranteed to be true. > The filter node uses this expression to simplify the filter it needs > to apply. For example, if your custom scanner determines that `x > > 50` is always true then that can be attached as a guarantee. Later, > if you need to apply the filter `x < 30` then the filter node knows it > can exclude the entire batch based on the guarantee. However, the > guarantee suffers from the above described "batch-level" problems > (e.g. a join node will not include guarantees in the output). > > Can you attach your metadata as an actual column using a scalar? This > is what we do with the __filename column today. > > On Mon, May 9, 2022 at 5:24 AM Yaron Gvili <rt...@hotmail.com> wrote: >> >> Hi Yue, >> >> From my limited experience with the execution engine, my understanding is >> that the API allows streaming only an ExecBatch from one node to another. A >> possible solution is to derive from ExecBatch your own class (say) >> RichExecBatch that carries any extra metadata you want. If in your execution >> plan, each node that expects to receive a RichExecBatch gets it directly >> from a sending node that makes it (both of which you could implement), then >> I think this could work and may be enough for your use case. However, note >> that when there are intermediate nodes in between such sending and receiving >> nodes, this may well break because an intermediate node could output a fresh >> ExecBatch even when receiving a RichExecBatch as input, like filter_node >> does [1], for example. >> >> [1] >> https://github.com/apache/arrow/blob/35119f29b0e0de68b1ccc5f2066e0cc7d27fddd0/cpp/src/arrow/compute/exec/filter_node.cc#L98 >> >> >> Yaron. >> >> ________________________________ >> From: Yue Ni <niyue....@gmail.com> >> Sent: Monday, May 9, 2022 10:28 AM >> To: dev@arrow.apache.org <dev@arrow.apache.org> >> Subject: ExecBatch in arrow execution engine >> >> Hi there, >> >> I would like to use apache arrow execution engine for some computation. I >> found `ExecBatch` instead of `RecordBatch` is used for execution engine's >> node, and I wonder how I can attach some additional information such as >> schema/metadata for the `ExecBatch` during execution so that they can be >> used by a custom ExecNode. >> >> In my first use case, the computation flow looks like this: >> >> scanner <===> custom filter node <===> query client >> >> 1) The scanner is a custom scanner that will load some data from disk, and >> it accepts a pushed down custom filter expression (not the arrow filter >> expression but a homebrewed filter expression), and the scanner will use >> this custom filter expression to avoid loading data from disk as much as >> possible but it may return a superset of matching data to the successor >> nodes because the capability of pushed down filter. >> >> 2) And its successor node is a filter node, which will do some additional >> filtering if needed. The scanner is aware that if a result batch retrieved >> needs additional filtering or not, and I would like to make scanner pass >> some batch specific metadata like "additional_filtering_required: >> true/false" along with the batch to the filter node, but I cannot figure >> out how this could be done for the `ExecBatch`. >> >> In my other use case, I would like to attach a batch specific schema to >> each batch returned by some nodes. >> >> Basically, I wonder within the current framework, if there is any chance I >> could attach some additional execution metadata/schema to the `ExecBatch` >> so that they could be used by a custom exec node. Could you please help? >> Thanks.