Weston - was trying the pyarrow approach you suggested: >def custom_source(endpoint): return pc.Declaration("my_custom_source", create_my_custom_options())
(1) I didn't see "Declaration" under pyarrow.compute - which package is this under? (2) What Python object should I return with create_my_custom_options()? Currently I only have a C++ class for my custom option. On Thu, Oct 13, 2022 at 12:58 PM Li Jin <ice.xell...@gmail.com> wrote: > > I may be assuming here but I think your problem is more that there is > no way to more flexibly describe a source in python and less that you > need to change the default. > > Correct. > > > For example, if you could do something like this (in pyarrow) would it > work? > I could try to see if that works. I feel registering the extension in C++ > via one initialization seems cleaner to me because there are many other > extension points that we initialize (add registering in the > default_exec_factory_registry > similar to > https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30). > Having some of the initialization in Pyarrow and some in Acero is a bit > complicated IMO. > > Maybe the proper long term fix is to add something similar to > compute::default_exec_factory_registry for all extension points (named > table, substrait extension message) and the user can register custom stuff > similar to how they can register via default_exec_factory_registry?) > > On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com> > wrote: > >> > Does that sound like a reasonable way to do this? >> >> It's not ideal. >> >> I may be assuming here but I think your problem is more that there is >> no way to more flexibly describe a source in python and less that you >> need to change the default. >> >> For example, if you could do something like this (in pyarrow) would it >> work? >> >> ``` >> def custom_source(endpoint): >> return pc.Declaration("my_custom_source", create_my_custom_options()) >> >> def table_provider(names): >> return custom_sources[names[0]] >> >> pa.substrait.run_query(my_plan, table_provider=table_provider) >> ``` >> >> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote: >> > >> > We did some work around this recently and think there needs to be some >> > small change to allow users to override this default provider. I will >> > explain in more details: >> > >> > (1) Since the variable is defined as static in the substrait/options.h >> > file, each translation unit will have a separate copy of the >> > kDefaultNamedTableProvider >> > variable. And therefore, the user cannot really change the default that >> is >> > used here: >> > >> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 >> > >> > In order to allow user to override the kDefaultNamedTableProvider (and >> > change the behavior of >> > >> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 >> > to use a custom NamedTableProvider), we need to >> > (1) in substrait/options.hh, change the definition of >> > kDefaultNamedTableProvider to be an extern declaration >> > (2) move the definition of kDefaultNamedTableProvider to an >> > substrait/options.cc file >> > >> > We are still testing this but based on my limited C++ knowledge, I >> > think this would allow users to do >> > """ >> > #include "arrow/engine/substrait/options.h" >> > >> > void initialize() { >> > arrow::engine::kDefaultNamedTableProvider = >> > some_custom_name_table_provider; >> > } >> > """ >> > And then calling `pa.substrat.run_query" should pick up the custom name >> > table provider. >> > >> > Does that sound like a reasonable way to do this? >> > >> > >> > >> > >> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: >> > >> > > Thanks both. I think NamedTableProvider is close to what I want, and >> like >> > > Weston said, the tricky bit is how to use a custom NamedTableProvider >> when >> > > calling the pyarrow substrait API. >> > > >> > > It's a little hacky but I *think* I can override the value >> "kDefaultNamedTableProvider" >> > > here and pass "table_provider=None" then it "should" work: >> > > >> > > >> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66 >> > > >> > > I am going to give that a shot once I pull/build Arrow default into >> our >> > > internal build system. >> > > >> > > >> > > >> > > >> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman < >> bengil...@gmail.com> >> > > wrote: >> > > >> > >> It seems to me that your use case could be handled by defining a >> custom >> > >> NamedTableProvider and >> > >> assigning this to ConversionOptions::named_table_provider. This was >> added >> > >> in >> > >> https://github.com/apache/arrow/pull/13613 to provide user >> configurable >> > >> dispatching for named tables; >> > >> if it doesn't address your use case then we might want to create a >> JIRA to >> > >> extend it. >> > >> >> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> >> wrote: >> > >> >> > >> > I did some more digging into this and have some ideas - >> > >> > >> > >> > Currently, the logic for deserialization named table is: >> > >> > >> > >> > >> > >> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129 >> > >> > and it will look up named tables from a user provided dictionary >> from >> > >> > string -> arrow Table. >> > >> > >> > >> > My idea is to make some short term changes to allow named tables >> to be >> > >> > dispatched differently (This logic can be reverted/removed once we >> > >> figure >> > >> > out the proper way to support custom data sources, perhaps via >> substrait >> > >> > Extensions.), specifically: >> > >> > >> > >> > (1) The user creates named table with uris for custom data source, >> i.e., >> > >> > "my_datasource://tablename?begin=20200101&end=20210101" >> > >> > (2) In the substrait consumer, allowing user to register custom >> dispatch >> > >> > rules based on uri scheme (similar to how exec node registry >> works), >> > >> i.e., >> > >> > sth like: >> > >> > >> > >> > substrait_named_table_registry.add("my_datasource", >> deser_my_datasource) >> > >> > and deser_my_datasource is a function that takes the NamedTable >> > >> substrait >> > >> > message and returns a declaration. >> > >> > >> > >> > I know doing this just for named tables might not be a very general >> > >> > solution but seems the easiest path forward, and we can always >> remove >> > >> this >> > >> > later in favor of a more generic solution. >> > >> > >> > >> > Thoughts? >> > >> > >> > >> > Li >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> >> wrote: >> > >> > >> > >> > > Hello! >> > >> > > >> > >> > > I am working on adding a custom data source node in Acero. I >> have a >> > >> few >> > >> > > previous threads related to this topic. >> > >> > > >> > >> > > Currently, I am able to register my custom factory method with >> Acero >> > >> and >> > >> > > create a Custom source node, i.e., I can register and execute >> this >> > >> with >> > >> > > Acero: >> > >> > > >> > >> > > MySourceNodeOptions source_options = ... >> > >> > > Declaration source{"my_source", source_option} >> > >> > > >> > >> > > The next step I want to do is to pass this through to the Acero >> > >> substrait >> > >> > > consumer. From previous discussions, I am going to use >> "NamedTable '' >> > >> as >> > >> > a >> > >> > > temporary way to define my custom data source in substrait. My >> > >> question >> > >> > is >> > >> > > this: >> > >> > > >> > >> > > What I need to do in substrait in order to register my own >> substrait >> > >> > > consumer rule/function for deserializing my custom named table >> > >> protobuf >> > >> > > message into the declaration above. If this is not supported >> right >> > >> now, >> > >> > > what is a reasonable/minimal change to make this work? >> > >> > > >> > >> > > Thanks, >> > >> > > Li >> > >> > > >> > >> > >> > >> >> > > >> >