Thanks Yaron - I have figured something out. Currently I have created an internal c++ codebase that exposes the "Initialize" method and have an internal Python codebase that invokes it via Python/C++ bindings similar to how dataset does it.
On Wed, Sep 28, 2022 at 1:02 PM Yaron Gvili <[email protected]> wrote: > I agree with Weston about dynamically loading a shared object with > initialization code for registering node factories. For custom node > factories, I think this loading would best be done from a separate Python > module, different than "_exec_plan.pyx", that the user would need to import > for triggering (once) the registration. This would avoid merging custom > code into "_exec_plan.pyx" and maintaining it. You would likely want to > code up files for your module that are analogous to > "python/pyarrow/includes/libarrow_dataset.pxd", > "python/pyarrow/_dataset.pxd", and "python/pyarrow/dataset.py". You would > need to modify the files "python/setup.py" and "python/CMakeLists.txt" in > order to build your module within PyArrow's build, or alternatively to roll > your own version of these files to build your Python module separately. > This is where you would add a build flag for pulling in C++ header files > for your Python module, under "python/pyarrow/include", and for making it. > > > Yaron. > ________________________________ > From: Li Jin <[email protected]> > Sent: Wednesday, September 21, 2022 3:51 PM > To: [email protected] <[email protected]> > Subject: Re: Register custom ExecNode factories > > Thanks Weston - I have not rewritten Python/C++ bridge so this is also new > to me and I am hoping to get some information from people that know how to > do this. > > I will leave this open for other people to offer help :) and will ask some > internal folks as well. > > Will circle back on this. > > On Tue, Sep 20, 2022 at 8:50 PM Weston Pace <[email protected]> wrote: > > > I'm not great at this build stuff but I think the basic idea is that > > you will need to package your custom nodes into a shared object. > > You'll need to then somehow trigger that shared object to load from > > python. This seems like a good place to invoke the initialize method. > > > > Currently pyarrow has to do this because the datasets module > > (libarrow_dataset.so) adds some custom nodes (scan node, dataset write > > node). The datasets module defines the Initialize method. This > > method is called in _exec_plan.pyx when the python module is loaded. > > I don't know cython well enough to know how exactly it triggers the > > datasets shared object to load. > > > > On Tue, Sep 20, 2022 at 11:01 AM Li Jin <[email protected]> wrote: > > > > > > Hi, > > > > > > Recently I am working on adding a custom data source node to Acero and > > was > > > pointed to a few examples in the dataset code. > > > > > > If I understand this correctly, the registering of dataset exec node is > > > currently happening when this is loaded: > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/_exec_plan.pyx#L36 > > > > > > I wonder if I have a custom "Initialize'' method that registers > > additional > > > ExecNode, where is the right place to invoke such initialization? > > > Eventually I want to execute my query via ibis-substrait and Acero > > > substrait consumer Python API. > > > > > > Thanks, > > > Li > > >
