output_schema for ExecNode

Yue Ni Sun, 14 Nov 2021 21:03:18 -0800

Hi there,

I am evaluating Apache Arrow C++ compute engine for my project, and wonder
what the schema assumption is for execution operators in the compute
engine.


In my use case, multiple record batches for computation may have different
schemas. I read the Apache Arrow Query Engine for C++ design doc (
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4),and
for the `Scan` operator, it is said "The consumer of a Scan does not need
to know how it is implemented, only that a uniform API is provided to
obtain the next RecordBatch with a known schema.", I interpret this as
`Scan` operator may produce multiple RecordBatches, and each of them should
have a known schema, but next batch's schema could be different with the
previous batch. Is this understanding correct?

And I read arrow's source code, in `exec_plan.h`:
```
class ARROW_EXPORT ExecNode {
...
/// The datatypes for batches produced by this node
  const std::shared_ptr<Schema>& output_schema() const { return
output_schema_; }
...
```
It looks like each `ExecNode` needs to provide an `output_schema`. Is it
allowed to return `output_schema` that may change during ExecNode's
execution? If I would like to implement an execution node that will produce
multiple batches that may have different schemas, is this feasible within
Arrow C++ compute engine API framework? Thanks.

Regards,
Yue

output_schema for ExecNode

Reply via email to