We did some work around this recently and think there needs to be some small change to allow users to override this default provider. I will explain in more details:
(1) Since the variable is defined as static in the substrait/options.h file, each translation unit will have a separate copy of the kDefaultNamedTableProvider variable. And therefore, the user cannot really change the default that is used here: https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 In order to allow user to override the kDefaultNamedTableProvider (and change the behavior of https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 to use a custom NamedTableProvider), we need to (1) in substrait/options.hh, change the definition of kDefaultNamedTableProvider to be an extern declaration (2) move the definition of kDefaultNamedTableProvider to an substrait/options.cc file We are still testing this but based on my limited C++ knowledge, I think this would allow users to do """ #include "arrow/engine/substrait/options.h" void initialize() { arrow::engine::kDefaultNamedTableProvider = some_custom_name_table_provider; } """ And then calling `pa.substrat.run_query" should pick up the custom name table provider. Does that sound like a reasonable way to do this? On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: > Thanks both. I think NamedTableProvider is close to what I want, and like > Weston said, the tricky bit is how to use a custom NamedTableProvider when > calling the pyarrow substrait API. > > It's a little hacky but I *think* I can override the value > "kDefaultNamedTableProvider" > here and pass "table_provider=None" then it "should" work: > > https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66 > > I am going to give that a shot once I pull/build Arrow default into our > internal build system. > > > > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <bengil...@gmail.com> > wrote: > >> It seems to me that your use case could be handled by defining a custom >> NamedTableProvider and >> assigning this to ConversionOptions::named_table_provider. This was added >> in >> https://github.com/apache/arrow/pull/13613 to provide user configurable >> dispatching for named tables; >> if it doesn't address your use case then we might want to create a JIRA to >> extend it. >> >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> wrote: >> >> > I did some more digging into this and have some ideas - >> > >> > Currently, the logic for deserialization named table is: >> > >> > >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129 >> > and it will look up named tables from a user provided dictionary from >> > string -> arrow Table. >> > >> > My idea is to make some short term changes to allow named tables to be >> > dispatched differently (This logic can be reverted/removed once we >> figure >> > out the proper way to support custom data sources, perhaps via substrait >> > Extensions.), specifically: >> > >> > (1) The user creates named table with uris for custom data source, i.e., >> > "my_datasource://tablename?begin=20200101&end=20210101" >> > (2) In the substrait consumer, allowing user to register custom dispatch >> > rules based on uri scheme (similar to how exec node registry works), >> i.e., >> > sth like: >> > >> > substrait_named_table_registry.add("my_datasource", deser_my_datasource) >> > and deser_my_datasource is a function that takes the NamedTable >> substrait >> > message and returns a declaration. >> > >> > I know doing this just for named tables might not be a very general >> > solution but seems the easiest path forward, and we can always remove >> this >> > later in favor of a more generic solution. >> > >> > Thoughts? >> > >> > Li >> > >> > >> > >> > >> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> wrote: >> > >> > > Hello! >> > > >> > > I am working on adding a custom data source node in Acero. I have a >> few >> > > previous threads related to this topic. >> > > >> > > Currently, I am able to register my custom factory method with Acero >> and >> > > create a Custom source node, i.e., I can register and execute this >> with >> > > Acero: >> > > >> > > MySourceNodeOptions source_options = ... >> > > Declaration source{"my_source", source_option} >> > > >> > > The next step I want to do is to pass this through to the Acero >> substrait >> > > consumer. From previous discussions, I am going to use "NamedTable '' >> as >> > a >> > > temporary way to define my custom data source in substrait. My >> question >> > is >> > > this: >> > > >> > > What I need to do in substrait in order to register my own substrait >> > > consumer rule/function for deserializing my custom named table >> protobuf >> > > message into the declaration above. If this is not supported right >> now, >> > > what is a reasonable/minimal change to make this work? >> > > >> > > Thanks, >> > > Li >> > > >> > >> >