Re: Substrait consumer for custom data sources

Li Jin Thu, 13 Oct 2022 15:03:18 -0700

After some struggling we finally managed to connect our internal data
source to Acero and executed a data load via pyarrow.substrait.run_query() !


We did end up temporarily modifying substrait/options.h source code locally
and made kDefaultNamedTableProvider extern/global.

But since this doesn't sound like the correct way, I am happy to do this
correctly but someone let me know the correct way :)

Li

On Thu, Oct 13, 2022 at 2:01 PM Li Jin <[email protected]> wrote:

> Going back to the default_exec_factory_registry idea, I think ultimately
> maybe we want registration API that looks like:
>
> """
> MetaRegistry* registry = compute::default_meta_registry();
> registry->RegisterNamedTableProvider(...);
> registry->exec_factory_registry()->AddFactory("my_custom_node",
> MakeMyCustomNode)
> ...
> """
>
> On Thu, Oct 13, 2022 at 1:32 PM Li Jin <[email protected]> wrote:
>
>> Weston - was trying the pyarrow approach you suggested:
>>
>> >def custom_source(endpoint):
>>   return pc.Declaration("my_custom_source", create_my_custom_options())
>>
>> (1) I didn't see "Declaration" under pyarrow.compute - which package is
>> this under?
>> (2) What Python object should I return with  create_my_custom_options()?
>> Currently I only have a C++ class for my custom option.
>>
>> On Thu, Oct 13, 2022 at 12:58 PM Li Jin <[email protected]> wrote:
>>
>>> > I may be assuming here but I think your problem is more that there is
>>> no way to more flexibly describe a source in python and less that you
>>> need to change the default.
>>>
>>> Correct.
>>>
>>> > For example, if you could do something like this (in pyarrow) would it
>>> work?
>>> I could try to see if that works. I feel registering the extension in
>>> C++ via one initialization seems cleaner to me because there are many other
>>> extension points that we initialize (add registering in the 
>>> default_exec_factory_registry
>>> similar to
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30).
>>> Having some of the initialization in Pyarrow and some in Acero is a bit
>>> complicated IMO.
>>>
>>> Maybe the proper long term fix is to add something similar to
>>> compute::default_exec_factory_registry for all extension points (named
>>> table, substrait extension message) and the user can register custom stuff
>>> similar to how they can register via default_exec_factory_registry?)
>>>
>>> On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <[email protected]>
>>> wrote:
>>>
>>>> > Does that sound like a reasonable way to do this?
>>>>
>>>> It's not ideal.
>>>>
>>>> I may be assuming here but I think your problem is more that there is
>>>> no way to more flexibly describe a source in python and less that you
>>>> need to change the default.
>>>>
>>>> For example, if you could do something like this (in pyarrow) would it
>>>> work?
>>>>
>>>> ```
>>>> def custom_source(endpoint):
>>>>   return pc.Declaration("my_custom_source", create_my_custom_options())
>>>>
>>>> def table_provider(names):
>>>>   return custom_sources[names[0]]
>>>>
>>>> pa.substrait.run_query(my_plan, table_provider=table_provider)
>>>> ```
>>>>
>>>> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <[email protected]> wrote:
>>>> >
>>>> > We did some work around this recently and think there needs to be some
>>>> > small change to allow users to override this default provider. I will
>>>> > explain in more details:
>>>> >
>>>> > (1) Since the variable is defined as static in the substrait/options.h
>>>> > file, each translation unit will have a separate copy of the
>>>> > kDefaultNamedTableProvider
>>>> > variable. And therefore, the user cannot really change the default
>>>> that is
>>>> > used here:
>>>> >
>>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
>>>> >
>>>> > In order to allow user to override the kDefaultNamedTableProvider (and
>>>> > change the behavior of
>>>> >
>>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
>>>> > to use a custom NamedTableProvider), we need to
>>>> > (1) in substrait/options.hh, change the definition of
>>>> > kDefaultNamedTableProvider to be an extern declaration
>>>> > (2) move the definition of kDefaultNamedTableProvider to an
>>>> > substrait/options.cc file
>>>> >
>>>> > We are still testing this but based on my limited C++ knowledge, I
>>>> > think this would allow users to do
>>>> > """
>>>> > #include "arrow/engine/substrait/options.h"
>>>> >
>>>> > void initialize() {
>>>> >     arrow::engine::kDefaultNamedTableProvider =
>>>> > some_custom_name_table_provider;
>>>> > }
>>>> > """
>>>> > And then calling `pa.substrat.run_query" should pick up the custom
>>>> name
>>>> > table provider.
>>>> >
>>>> > Does that sound like a reasonable way to do this?
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <[email protected]> wrote:
>>>> >
>>>> > > Thanks both. I think NamedTableProvider is close to what I want,
>>>> and like
>>>> > > Weston said, the tricky bit is how to use a custom
>>>> NamedTableProvider when
>>>> > > calling the pyarrow substrait API.
>>>> > >
>>>> > > It's a little hacky but I *think* I can override the value
>>>> "kDefaultNamedTableProvider"
>>>> > > here and pass "table_provider=None" then it "should" work:
>>>> > >
>>>> > >
>>>> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66
>>>> > >
>>>> > > I am going to give that a shot once I pull/build Arrow default into
>>>> our
>>>> > > internal build system.
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <
>>>> [email protected]>
>>>> > > wrote:
>>>> > >
>>>> > >> It seems to me that your use case could be handled by defining a
>>>> custom
>>>> > >> NamedTableProvider and
>>>> > >> assigning this to ConversionOptions::named_table_provider. This
>>>> was added
>>>> > >> in
>>>> > >> https://github.com/apache/arrow/pull/13613 to provide user
>>>> configurable
>>>> > >> dispatching for named tables;
>>>> > >> if it doesn't address your use case then we might want to create a
>>>> JIRA to
>>>> > >> extend it.
>>>> > >>
>>>> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <[email protected]>
>>>> wrote:
>>>> > >>
>>>> > >> > I did some more digging into this and have some ideas -
>>>> > >> >
>>>> > >> > Currently, the logic for deserialization named table is:
>>>> > >> >
>>>> > >> >
>>>> > >>
>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
>>>> > >> > and it will look up named tables from a user provided dictionary
>>>> from
>>>> > >> > string -> arrow Table.
>>>> > >> >
>>>> > >> > My idea is to make some short term changes to allow named tables
>>>> to be
>>>> > >> > dispatched differently (This logic can be reverted/removed once
>>>> we
>>>> > >> figure
>>>> > >> > out the proper way to support custom data sources, perhaps via
>>>> substrait
>>>> > >> > Extensions.), specifically:
>>>> > >> >
>>>> > >> > (1) The user creates named table with uris for custom data
>>>> source, i.e.,
>>>> > >> > "my_datasource://tablename?begin=20200101&end=20210101"
>>>> > >> > (2) In the substrait consumer, allowing user to register custom
>>>> dispatch
>>>> > >> > rules based on uri scheme (similar to how exec node registry
>>>> works),
>>>> > >> i.e.,
>>>> > >> > sth like:
>>>> > >> >
>>>> > >> > substrait_named_table_registry.add("my_datasource",
>>>> deser_my_datasource)
>>>> > >> > and deser_my_datasource is a function that takes the NamedTable
>>>> > >> substrait
>>>> > >> > message and returns a declaration.
>>>> > >> >
>>>> > >> > I know doing this just for named tables might not be a very
>>>> general
>>>> > >> > solution but seems the easiest path forward, and we can always
>>>> remove
>>>> > >> this
>>>> > >> > later in favor of a more generic solution.
>>>> > >> >
>>>> > >> > Thoughts?
>>>> > >> >
>>>> > >> > Li
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <[email protected]>
>>>> wrote:
>>>> > >> >
>>>> > >> > > Hello!
>>>> > >> > >
>>>> > >> > > I am working on adding a custom data source node in Acero. I
>>>> have a
>>>> > >> few
>>>> > >> > > previous threads related to this topic.
>>>> > >> > >
>>>> > >> > > Currently, I am able to register my custom factory method with
>>>> Acero
>>>> > >> and
>>>> > >> > > create a Custom source node, i.e., I can register and execute
>>>> this
>>>> > >> with
>>>> > >> > > Acero:
>>>> > >> > >
>>>> > >> > > MySourceNodeOptions source_options = ...
>>>> > >> > > Declaration source{"my_source", source_option}
>>>> > >> > >
>>>> > >> > > The next step I want to do is to pass this through to the Acero
>>>> > >> substrait
>>>> > >> > > consumer. From previous discussions, I am going to use
>>>> "NamedTable ''
>>>> > >> as
>>>> > >> > a
>>>> > >> > > temporary way to define my custom data source in substrait. My
>>>> > >> question
>>>> > >> > is
>>>> > >> > > this:
>>>> > >> > >
>>>> > >> > > What I need to do in substrait in order to register my own
>>>> substrait
>>>> > >> > > consumer rule/function for deserializing my custom named table
>>>> > >> protobuf
>>>> > >> > > message into the declaration above. If this is not supported
>>>> right
>>>> > >> now,
>>>> > >> > > what is a reasonable/minimal change to make this work?
>>>> > >> > >
>>>> > >> > > Thanks,
>>>> > >> > > Li
>>>> > >> > >
>>>> > >> >
>>>> > >>
>>>> > >
>>>>
>>>

Re: Substrait consumer for custom data sources

Reply via email to