> I may be assuming here but I think your problem is more that there is no way to more flexibly describe a source in python and less that you need to change the default.
Correct. > For example, if you could do something like this (in pyarrow) would it work? I could try to see if that works. I feel registering the extension in C++ via one initialization seems cleaner to me because there are many other extension points that we initialize (add registering in the default_exec_factory_registry similar to https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30). Having some of the initialization in Pyarrow and some in Acero is a bit complicated IMO. Maybe the proper long term fix is to add something similar to compute::default_exec_factory_registry for all extension points (named table, substrait extension message) and the user can register custom stuff similar to how they can register via default_exec_factory_registry?) On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com> wrote: > > Does that sound like a reasonable way to do this? > > It's not ideal. > > I may be assuming here but I think your problem is more that there is > no way to more flexibly describe a source in python and less that you > need to change the default. > > For example, if you could do something like this (in pyarrow) would it > work? > > ``` > def custom_source(endpoint): > return pc.Declaration("my_custom_source", create_my_custom_options()) > > def table_provider(names): > return custom_sources[names[0]] > > pa.substrait.run_query(my_plan, table_provider=table_provider) > ``` > > On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote: > > > > We did some work around this recently and think there needs to be some > > small change to allow users to override this default provider. I will > > explain in more details: > > > > (1) Since the variable is defined as static in the substrait/options.h > > file, each translation unit will have a separate copy of the > > kDefaultNamedTableProvider > > variable. And therefore, the user cannot really change the default that > is > > used here: > > > https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 > > > > In order to allow user to override the kDefaultNamedTableProvider (and > > change the behavior of > > > https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 > > to use a custom NamedTableProvider), we need to > > (1) in substrait/options.hh, change the definition of > > kDefaultNamedTableProvider to be an extern declaration > > (2) move the definition of kDefaultNamedTableProvider to an > > substrait/options.cc file > > > > We are still testing this but based on my limited C++ knowledge, I > > think this would allow users to do > > """ > > #include "arrow/engine/substrait/options.h" > > > > void initialize() { > > arrow::engine::kDefaultNamedTableProvider = > > some_custom_name_table_provider; > > } > > """ > > And then calling `pa.substrat.run_query" should pick up the custom name > > table provider. > > > > Does that sound like a reasonable way to do this? > > > > > > > > > > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: > > > > > Thanks both. I think NamedTableProvider is close to what I want, and > like > > > Weston said, the tricky bit is how to use a custom NamedTableProvider > when > > > calling the pyarrow substrait API. > > > > > > It's a little hacky but I *think* I can override the value > "kDefaultNamedTableProvider" > > > here and pass "table_provider=None" then it "should" work: > > > > > > > https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66 > > > > > > I am going to give that a shot once I pull/build Arrow default into our > > > internal build system. > > > > > > > > > > > > > > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman < > bengil...@gmail.com> > > > wrote: > > > > > >> It seems to me that your use case could be handled by defining a > custom > > >> NamedTableProvider and > > >> assigning this to ConversionOptions::named_table_provider. This was > added > > >> in > > >> https://github.com/apache/arrow/pull/13613 to provide user > configurable > > >> dispatching for named tables; > > >> if it doesn't address your use case then we might want to create a > JIRA to > > >> extend it. > > >> > > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> > wrote: > > >> > > >> > I did some more digging into this and have some ideas - > > >> > > > >> > Currently, the logic for deserialization named table is: > > >> > > > >> > > > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129 > > >> > and it will look up named tables from a user provided dictionary > from > > >> > string -> arrow Table. > > >> > > > >> > My idea is to make some short term changes to allow named tables to > be > > >> > dispatched differently (This logic can be reverted/removed once we > > >> figure > > >> > out the proper way to support custom data sources, perhaps via > substrait > > >> > Extensions.), specifically: > > >> > > > >> > (1) The user creates named table with uris for custom data source, > i.e., > > >> > "my_datasource://tablename?begin=20200101&end=20210101" > > >> > (2) In the substrait consumer, allowing user to register custom > dispatch > > >> > rules based on uri scheme (similar to how exec node registry works), > > >> i.e., > > >> > sth like: > > >> > > > >> > substrait_named_table_registry.add("my_datasource", > deser_my_datasource) > > >> > and deser_my_datasource is a function that takes the NamedTable > > >> substrait > > >> > message and returns a declaration. > > >> > > > >> > I know doing this just for named tables might not be a very general > > >> > solution but seems the easiest path forward, and we can always > remove > > >> this > > >> > later in favor of a more generic solution. > > >> > > > >> > Thoughts? > > >> > > > >> > Li > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> > wrote: > > >> > > > >> > > Hello! > > >> > > > > >> > > I am working on adding a custom data source node in Acero. I have > a > > >> few > > >> > > previous threads related to this topic. > > >> > > > > >> > > Currently, I am able to register my custom factory method with > Acero > > >> and > > >> > > create a Custom source node, i.e., I can register and execute this > > >> with > > >> > > Acero: > > >> > > > > >> > > MySourceNodeOptions source_options = ... > > >> > > Declaration source{"my_source", source_option} > > >> > > > > >> > > The next step I want to do is to pass this through to the Acero > > >> substrait > > >> > > consumer. From previous discussions, I am going to use > "NamedTable '' > > >> as > > >> > a > > >> > > temporary way to define my custom data source in substrait. My > > >> question > > >> > is > > >> > > this: > > >> > > > > >> > > What I need to do in substrait in order to register my own > substrait > > >> > > consumer rule/function for deserializing my custom named table > > >> protobuf > > >> > > message into the declaration above. If this is not supported right > > >> now, > > >> > > what is a reasonable/minimal change to make this work? > > >> > > > > >> > > Thanks, > > >> > > Li > > >> > > > > >> > > > >> > > > >