Thanks Yaron - I have figured something out. Currently I have created an
internal c++ codebase that exposes the "Initialize" method and have an
internal Python codebase that invokes it via Python/C++ bindings similar to
how dataset does it.

On Wed, Sep 28, 2022 at 1:02 PM Yaron Gvili <[email protected]> wrote:

> I agree with Weston about dynamically loading a shared object with
> initialization code for registering node factories. For custom node
> factories, I think this loading would best be done from a separate Python
> module, different than "_exec_plan.pyx", that the user would need to import
> for triggering (once) the registration. This would avoid merging custom
> code into "_exec_plan.pyx" and maintaining it. You would likely want to
> code up files for your module that are analogous to
> "python/pyarrow/includes/libarrow_dataset.pxd",
> "python/pyarrow/_dataset.pxd", and "python/pyarrow/dataset.py". You would
> need to modify the files "python/setup.py" and "python/CMakeLists.txt" in
> order to build your module within PyArrow's build, or alternatively to roll
> your own version of these files to build your Python module separately.
> This is where you would add a build flag for pulling in C++ header files
> for your Python module, under "python/pyarrow/include", and for making it.
>
>
> Yaron.
> ________________________________
> From: Li Jin <[email protected]>
> Sent: Wednesday, September 21, 2022 3:51 PM
> To: [email protected] <[email protected]>
> Subject: Re: Register custom ExecNode factories
>
> Thanks Weston - I have not rewritten Python/C++ bridge so this is also new
> to me and I am hoping to get some information from people that know how to
> do this.
>
> I will leave this open for other people to offer help :) and will ask some
> internal folks as well.
>
> Will circle back on this.
>
> On Tue, Sep 20, 2022 at 8:50 PM Weston Pace <[email protected]> wrote:
>
> > I'm not great at this build stuff but I think the basic idea is that
> > you will need to package your custom nodes into a shared object.
> > You'll need to then somehow trigger that shared object to load from
> > python.  This seems like a good place to invoke the initialize method.
> >
> > Currently pyarrow has to do this because the datasets module
> > (libarrow_dataset.so) adds some custom nodes (scan node, dataset write
> > node).  The datasets module defines the Initialize method.  This
> > method is called in _exec_plan.pyx when the python module is loaded.
> > I don't know cython well enough to know how exactly it triggers the
> > datasets shared object to load.
> >
> > On Tue, Sep 20, 2022 at 11:01 AM Li Jin <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > Recently I am working on adding a custom data source node to Acero and
> > was
> > > pointed to a few examples in the dataset code.
> > >
> > > If I understand this correctly, the registering of dataset exec node is
> > > currently happening when this is loaded:
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_exec_plan.pyx#L36
> > >
> > > I wonder if I have a custom "Initialize'' method that registers
> > additional
> > > ExecNode, where is the right place to invoke such initialization?
> > > Eventually I want to execute my query via ibis-substrait and Acero
> > > substrait consumer Python API.
> > >
> > > Thanks,
> > > Li
> >
>

Reply via email to