I've opened a JIRA to track potential improvements to address this
https://issues.apache.org/jira/browse/ARROW-18063

On Wed, Sep 28, 2022 at 1:42 PM Li Jin <[email protected]> wrote:

> Thanks Yaron - I have figured something out. Currently I have created an
> internal c++ codebase that exposes the "Initialize" method and have an
> internal Python codebase that invokes it via Python/C++ bindings similar to
> how dataset does it.
>
> On Wed, Sep 28, 2022 at 1:02 PM Yaron Gvili <[email protected]> wrote:
>
> > I agree with Weston about dynamically loading a shared object with
> > initialization code for registering node factories. For custom node
> > factories, I think this loading would best be done from a separate Python
> > module, different than "_exec_plan.pyx", that the user would need to
> import
> > for triggering (once) the registration. This would avoid merging custom
> > code into "_exec_plan.pyx" and maintaining it. You would likely want to
> > code up files for your module that are analogous to
> > "python/pyarrow/includes/libarrow_dataset.pxd",
> > "python/pyarrow/_dataset.pxd", and "python/pyarrow/dataset.py". You would
> > need to modify the files "python/setup.py" and "python/CMakeLists.txt" in
> > order to build your module within PyArrow's build, or alternatively to
> roll
> > your own version of these files to build your Python module separately.
> > This is where you would add a build flag for pulling in C++ header files
> > for your Python module, under "python/pyarrow/include", and for making
> it.
> >
> >
> > Yaron.
> > ________________________________
> > From: Li Jin <[email protected]>
> > Sent: Wednesday, September 21, 2022 3:51 PM
> > To: [email protected] <[email protected]>
> > Subject: Re: Register custom ExecNode factories
> >
> > Thanks Weston - I have not rewritten Python/C++ bridge so this is also
> new
> > to me and I am hoping to get some information from people that know how
> to
> > do this.
> >
> > I will leave this open for other people to offer help :) and will ask
> some
> > internal folks as well.
> >
> > Will circle back on this.
> >
> > On Tue, Sep 20, 2022 at 8:50 PM Weston Pace <[email protected]>
> wrote:
> >
> > > I'm not great at this build stuff but I think the basic idea is that
> > > you will need to package your custom nodes into a shared object.
> > > You'll need to then somehow trigger that shared object to load from
> > > python.  This seems like a good place to invoke the initialize method.
> > >
> > > Currently pyarrow has to do this because the datasets module
> > > (libarrow_dataset.so) adds some custom nodes (scan node, dataset write
> > > node).  The datasets module defines the Initialize method.  This
> > > method is called in _exec_plan.pyx when the python module is loaded.
> > > I don't know cython well enough to know how exactly it triggers the
> > > datasets shared object to load.
> > >
> > > On Tue, Sep 20, 2022 at 11:01 AM Li Jin <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Recently I am working on adding a custom data source node to Acero
> and
> > > was
> > > > pointed to a few examples in the dataset code.
> > > >
> > > > If I understand this correctly, the registering of dataset exec node
> is
> > > > currently happening when this is loaded:
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_exec_plan.pyx#L36
> > > >
> > > > I wonder if I have a custom "Initialize'' method that registers
> > > additional
> > > > ExecNode, where is the right place to invoke such initialization?
> > > > Eventually I want to execute my query via ibis-substrait and Acero
> > > > substrait consumer Python API.
> > > >
> > > > Thanks,
> > > > Li
> > >
> >
>

Reply via email to