> I may be assuming here but I think your problem is more that there is
no way to more flexibly describe a source in python and less that you
need to change the default.

Correct.

> For example, if you could do something like this (in pyarrow) would it
work?
I could try to see if that works. I feel registering the extension in C++
via one initialization seems cleaner to me because there are many other
extension points that we initialize (add registering in the
default_exec_factory_registry
similar to
https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30).
Having some of the initialization in Pyarrow and some in Acero is a bit
complicated IMO.

Maybe the proper long term fix is to add something similar to
compute::default_exec_factory_registry for all extension points (named
table, substrait extension message) and the user can register custom stuff
similar to how they can register via default_exec_factory_registry?)

On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com> wrote:

> > Does that sound like a reasonable way to do this?
>
> It's not ideal.
>
> I may be assuming here but I think your problem is more that there is
> no way to more flexibly describe a source in python and less that you
> need to change the default.
>
> For example, if you could do something like this (in pyarrow) would it
> work?
>
> ```
> def custom_source(endpoint):
>   return pc.Declaration("my_custom_source", create_my_custom_options())
>
> def table_provider(names):
>   return custom_sources[names[0]]
>
> pa.substrait.run_query(my_plan, table_provider=table_provider)
> ```
>
> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > We did some work around this recently and think there needs to be some
> > small change to allow users to override this default provider. I will
> > explain in more details:
> >
> > (1) Since the variable is defined as static in the substrait/options.h
> > file, each translation unit will have a separate copy of the
> > kDefaultNamedTableProvider
> > variable. And therefore, the user cannot really change the default that
> is
> > used here:
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
> >
> > In order to allow user to override the kDefaultNamedTableProvider (and
> > change the behavior of
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
> > to use a custom NamedTableProvider), we need to
> > (1) in substrait/options.hh, change the definition of
> > kDefaultNamedTableProvider to be an extern declaration
> > (2) move the definition of kDefaultNamedTableProvider to an
> > substrait/options.cc file
> >
> > We are still testing this but based on my limited C++ knowledge, I
> > think this would allow users to do
> > """
> > #include "arrow/engine/substrait/options.h"
> >
> > void initialize() {
> >     arrow::engine::kDefaultNamedTableProvider =
> > some_custom_name_table_provider;
> > }
> > """
> > And then calling `pa.substrat.run_query" should pick up the custom name
> > table provider.
> >
> > Does that sound like a reasonable way to do this?
> >
> >
> >
> >
> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > > Thanks both. I think NamedTableProvider is close to what I want, and
> like
> > > Weston said, the tricky bit is how to use a custom NamedTableProvider
> when
> > > calling the pyarrow substrait API.
> > >
> > > It's a little hacky but I *think* I can override the value
> "kDefaultNamedTableProvider"
> > > here and pass "table_provider=None" then it "should" work:
> > >
> > >
> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66
> > >
> > > I am going to give that a shot once I pull/build Arrow default into our
> > > internal build system.
> > >
> > >
> > >
> > >
> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <
> bengil...@gmail.com>
> > > wrote:
> > >
> > >> It seems to me that your use case could be handled by defining a
> custom
> > >> NamedTableProvider and
> > >> assigning this to ConversionOptions::named_table_provider. This was
> added
> > >> in
> > >> https://github.com/apache/arrow/pull/13613 to provide user
> configurable
> > >> dispatching for named tables;
> > >> if it doesn't address your use case then we might want to create a
> JIRA to
> > >> extend it.
> > >>
> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com>
> wrote:
> > >>
> > >> > I did some more digging into this and have some ideas -
> > >> >
> > >> > Currently, the logic for deserialization named table is:
> > >> >
> > >> >
> > >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
> > >> > and it will look up named tables from a user provided dictionary
> from
> > >> > string -> arrow Table.
> > >> >
> > >> > My idea is to make some short term changes to allow named tables to
> be
> > >> > dispatched differently (This logic can be reverted/removed once we
> > >> figure
> > >> > out the proper way to support custom data sources, perhaps via
> substrait
> > >> > Extensions.), specifically:
> > >> >
> > >> > (1) The user creates named table with uris for custom data source,
> i.e.,
> > >> > "my_datasource://tablename?begin=20200101&end=20210101"
> > >> > (2) In the substrait consumer, allowing user to register custom
> dispatch
> > >> > rules based on uri scheme (similar to how exec node registry works),
> > >> i.e.,
> > >> > sth like:
> > >> >
> > >> > substrait_named_table_registry.add("my_datasource",
> deser_my_datasource)
> > >> > and deser_my_datasource is a function that takes the NamedTable
> > >> substrait
> > >> > message and returns a declaration.
> > >> >
> > >> > I know doing this just for named tables might not be a very general
> > >> > solution but seems the easiest path forward, and we can always
> remove
> > >> this
> > >> > later in favor of a more generic solution.
> > >> >
> > >> > Thoughts?
> > >> >
> > >> > Li
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com>
> wrote:
> > >> >
> > >> > > Hello!
> > >> > >
> > >> > > I am working on adding a custom data source node in Acero. I have
> a
> > >> few
> > >> > > previous threads related to this topic.
> > >> > >
> > >> > > Currently, I am able to register my custom factory method with
> Acero
> > >> and
> > >> > > create a Custom source node, i.e., I can register and execute this
> > >> with
> > >> > > Acero:
> > >> > >
> > >> > > MySourceNodeOptions source_options = ...
> > >> > > Declaration source{"my_source", source_option}
> > >> > >
> > >> > > The next step I want to do is to pass this through to the Acero
> > >> substrait
> > >> > > consumer. From previous discussions, I am going to use
> "NamedTable ''
> > >> as
> > >> > a
> > >> > > temporary way to define my custom data source in substrait. My
> > >> question
> > >> > is
> > >> > > this:
> > >> > >
> > >> > > What I need to do in substrait in order to register my own
> substrait
> > >> > > consumer rule/function for deserializing my custom named table
> > >> protobuf
> > >> > > message into the declaration above. If this is not supported right
> > >> now,
> > >> > > what is a reasonable/minimal change to make this work?
> > >> > >
> > >> > > Thanks,
> > >> > > Li
> > >> > >
> > >> >
> > >>
> > >
>

Reply via email to