We did some work around this recently and think there needs to be some
small change to allow users to override this default provider. I will
explain in more details:

(1) Since the variable is defined as static in the substrait/options.h
file, each translation unit will have a separate copy of the
kDefaultNamedTableProvider
variable. And therefore, the user cannot really change the default that is
used here:
https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125

In order to allow user to override the kDefaultNamedTableProvider (and
change the behavior of
https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
to use a custom NamedTableProvider), we need to
(1) in substrait/options.hh, change the definition of
kDefaultNamedTableProvider to be an extern declaration
(2) move the definition of kDefaultNamedTableProvider to an
substrait/options.cc file

We are still testing this but based on my limited C++ knowledge, I
think this would allow users to do
"""
#include "arrow/engine/substrait/options.h"

void initialize() {
    arrow::engine::kDefaultNamedTableProvider =
some_custom_name_table_provider;
}
"""
And then calling `pa.substrat.run_query" should pick up the custom name
table provider.

Does that sound like a reasonable way to do this?




On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:

> Thanks both. I think NamedTableProvider is close to what I want, and like
> Weston said, the tricky bit is how to use a custom NamedTableProvider when
> calling the pyarrow substrait API.
>
> It's a little hacky but I *think* I can override the value 
> "kDefaultNamedTableProvider"
> here and pass "table_provider=None" then it "should" work:
>
> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66
>
> I am going to give that a shot once I pull/build Arrow default into our
> internal build system.
>
>
>
>
> On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <bengil...@gmail.com>
> wrote:
>
>> It seems to me that your use case could be handled by defining a custom
>> NamedTableProvider and
>> assigning this to ConversionOptions::named_table_provider. This was added
>> in
>> https://github.com/apache/arrow/pull/13613 to provide user configurable
>> dispatching for named tables;
>> if it doesn't address your use case then we might want to create a JIRA to
>> extend it.
>>
>> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> wrote:
>>
>> > I did some more digging into this and have some ideas -
>> >
>> > Currently, the logic for deserialization named table is:
>> >
>> >
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
>> > and it will look up named tables from a user provided dictionary from
>> > string -> arrow Table.
>> >
>> > My idea is to make some short term changes to allow named tables to be
>> > dispatched differently (This logic can be reverted/removed once we
>> figure
>> > out the proper way to support custom data sources, perhaps via substrait
>> > Extensions.), specifically:
>> >
>> > (1) The user creates named table with uris for custom data source, i.e.,
>> > "my_datasource://tablename?begin=20200101&end=20210101"
>> > (2) In the substrait consumer, allowing user to register custom dispatch
>> > rules based on uri scheme (similar to how exec node registry works),
>> i.e.,
>> > sth like:
>> >
>> > substrait_named_table_registry.add("my_datasource", deser_my_datasource)
>> > and deser_my_datasource is a function that takes the NamedTable
>> substrait
>> > message and returns a declaration.
>> >
>> > I know doing this just for named tables might not be a very general
>> > solution but seems the easiest path forward, and we can always remove
>> this
>> > later in favor of a more generic solution.
>> >
>> > Thoughts?
>> >
>> > Li
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> wrote:
>> >
>> > > Hello!
>> > >
>> > > I am working on adding a custom data source node in Acero. I have a
>> few
>> > > previous threads related to this topic.
>> > >
>> > > Currently, I am able to register my custom factory method with Acero
>> and
>> > > create a Custom source node, i.e., I can register and execute this
>> with
>> > > Acero:
>> > >
>> > > MySourceNodeOptions source_options = ...
>> > > Declaration source{"my_source", source_option}
>> > >
>> > > The next step I want to do is to pass this through to the Acero
>> substrait
>> > > consumer. From previous discussions, I am going to use "NamedTable ''
>> as
>> > a
>> > > temporary way to define my custom data source in substrait. My
>> question
>> > is
>> > > this:
>> > >
>> > > What I need to do in substrait in order to register my own substrait
>> > > consumer rule/function for deserializing my custom named table
>> protobuf
>> > > message into the declaration above. If this is not supported right
>> now,
>> > > what is a reasonable/minimal change to make this work?
>> > >
>> > > Thanks,
>> > > Li
>> > >
>> >
>>
>

Reply via email to