jorisvandenbossche opened a new issue, #33976:
URL: https://github.com/apache/arrow/issues/33976

   The Acero query engine is one of the few parts of the Arrow C++ libraries 
that is not directly exposed in the pyarrow bindings. While you can already run 
queries using Acero through a substrait plan (`pyarrow.substrait.run_query`; 
but that requires first building that plan, and limits to what Acero supports 
of Substrait), and Acero is already used under the hood in some Dataset methods 
(`pyarrow.dataset.Dataset.join/sort_by`, but that doesn't currently allow 
building up a full query), there are some benefits of having direct, low-level 
bindings:
   
   * It would be useful for direct users (or downstream packages) of pyarrow 
that don't want / need to go through substrait (yet another thing to learn / 
implement)
   * It would make it easier for developers to test new Acero functionality 
from python
   * You can use features of Acero that are not (yet) available in substrait
   * It could (potentially) give an interface for people developing custom exec 
nodes
   * It would be useful for people that want to do things "right now" (e.g. 
12.0.0) instead of waiting for Substrait (tooling) to mature
   
   The idea would be to add a new `pyarrow.acero` module, with low level 
bindings for the Acero execution engine. The goal would _not_ be to build a 
complete user-friendly query interface (like ibis or dplyr), but provide rather 
direct access to build up a query with the Declaration interface. Minimum 
abilities would be to construct a plan, inspect plans created through other 
means (eg from substrait), and execute a plan (to_table, to_reader, write). An 
initial goal would be to be able to use this interface ourselves in the places 
that currently have custom cython usage of the exec plan such as Dataset.join 
(i.e. refactoring `_exec_plan.pyx`).
   
   Non-exhaustive list of subtasks:
   
   - [ ] Basic API to create a query plan / declaration (including the options 
classes)
   - [ ] Replace the internal usage of execplan in Dataset join/sort_by methods
   - [ ] Improve usage of compute functions with (field) expressions
   - [ ] Ability to use registered UDFs in the query (specify the function 
registry)
   - [ ] Basic bindings for ExecPlan object? (not create it directly, but be 
able to inspect it when created from declaration or through substrait)
   - [ ] (enable using custom exec nodes?)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to