After some struggling we finally managed to connect our internal data source to Acero and executed a data load via pyarrow.substrait.run_query() !
We did end up temporarily modifying substrait/options.h source code locally and made kDefaultNamedTableProvider extern/global. But since this doesn't sound like the correct way, I am happy to do this correctly but someone let me know the correct way :) Li On Thu, Oct 13, 2022 at 2:01 PM Li Jin <ice.xell...@gmail.com> wrote: > Going back to the default_exec_factory_registry idea, I think ultimately > maybe we want registration API that looks like: > > """ > MetaRegistry* registry = compute::default_meta_registry(); > registry->RegisterNamedTableProvider(...); > registry->exec_factory_registry()->AddFactory("my_custom_node", > MakeMyCustomNode) > ... > """ > > On Thu, Oct 13, 2022 at 1:32 PM Li Jin <ice.xell...@gmail.com> wrote: > >> Weston - was trying the pyarrow approach you suggested: >> >> >def custom_source(endpoint): >> return pc.Declaration("my_custom_source", create_my_custom_options()) >> >> (1) I didn't see "Declaration" under pyarrow.compute - which package is >> this under? >> (2) What Python object should I return with create_my_custom_options()? >> Currently I only have a C++ class for my custom option. >> >> On Thu, Oct 13, 2022 at 12:58 PM Li Jin <ice.xell...@gmail.com> wrote: >> >>> > I may be assuming here but I think your problem is more that there is >>> no way to more flexibly describe a source in python and less that you >>> need to change the default. >>> >>> Correct. >>> >>> > For example, if you could do something like this (in pyarrow) would it >>> work? >>> I could try to see if that works. I feel registering the extension in >>> C++ via one initialization seems cleaner to me because there are many other >>> extension points that we initialize (add registering in the >>> default_exec_factory_registry >>> similar to >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30). >>> Having some of the initialization in Pyarrow and some in Acero is a bit >>> complicated IMO. >>> >>> Maybe the proper long term fix is to add something similar to >>> compute::default_exec_factory_registry for all extension points (named >>> table, substrait extension message) and the user can register custom stuff >>> similar to how they can register via default_exec_factory_registry?) >>> >>> On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com> >>> wrote: >>> >>>> > Does that sound like a reasonable way to do this? >>>> >>>> It's not ideal. >>>> >>>> I may be assuming here but I think your problem is more that there is >>>> no way to more flexibly describe a source in python and less that you >>>> need to change the default. >>>> >>>> For example, if you could do something like this (in pyarrow) would it >>>> work? >>>> >>>> ``` >>>> def custom_source(endpoint): >>>> return pc.Declaration("my_custom_source", create_my_custom_options()) >>>> >>>> def table_provider(names): >>>> return custom_sources[names[0]] >>>> >>>> pa.substrait.run_query(my_plan, table_provider=table_provider) >>>> ``` >>>> >>>> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote: >>>> > >>>> > We did some work around this recently and think there needs to be some >>>> > small change to allow users to override this default provider. I will >>>> > explain in more details: >>>> > >>>> > (1) Since the variable is defined as static in the substrait/options.h >>>> > file, each translation unit will have a separate copy of the >>>> > kDefaultNamedTableProvider >>>> > variable. And therefore, the user cannot really change the default >>>> that is >>>> > used here: >>>> > >>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 >>>> > >>>> > In order to allow user to override the kDefaultNamedTableProvider (and >>>> > change the behavior of >>>> > >>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 >>>> > to use a custom NamedTableProvider), we need to >>>> > (1) in substrait/options.hh, change the definition of >>>> > kDefaultNamedTableProvider to be an extern declaration >>>> > (2) move the definition of kDefaultNamedTableProvider to an >>>> > substrait/options.cc file >>>> > >>>> > We are still testing this but based on my limited C++ knowledge, I >>>> > think this would allow users to do >>>> > """ >>>> > #include "arrow/engine/substrait/options.h" >>>> > >>>> > void initialize() { >>>> > arrow::engine::kDefaultNamedTableProvider = >>>> > some_custom_name_table_provider; >>>> > } >>>> > """ >>>> > And then calling `pa.substrat.run_query" should pick up the custom >>>> name >>>> > table provider. >>>> > >>>> > Does that sound like a reasonable way to do this? >>>> > >>>> > >>>> > >>>> > >>>> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: >>>> > >>>> > > Thanks both. I think NamedTableProvider is close to what I want, >>>> and like >>>> > > Weston said, the tricky bit is how to use a custom >>>> NamedTableProvider when >>>> > > calling the pyarrow substrait API. >>>> > > >>>> > > It's a little hacky but I *think* I can override the value >>>> "kDefaultNamedTableProvider" >>>> > > here and pass "table_provider=None" then it "should" work: >>>> > > >>>> > > >>>> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66 >>>> > > >>>> > > I am going to give that a shot once I pull/build Arrow default into >>>> our >>>> > > internal build system. >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman < >>>> bengil...@gmail.com> >>>> > > wrote: >>>> > > >>>> > >> It seems to me that your use case could be handled by defining a >>>> custom >>>> > >> NamedTableProvider and >>>> > >> assigning this to ConversionOptions::named_table_provider. This >>>> was added >>>> > >> in >>>> > >> https://github.com/apache/arrow/pull/13613 to provide user >>>> configurable >>>> > >> dispatching for named tables; >>>> > >> if it doesn't address your use case then we might want to create a >>>> JIRA to >>>> > >> extend it. >>>> > >> >>>> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> >>>> wrote: >>>> > >> >>>> > >> > I did some more digging into this and have some ideas - >>>> > >> > >>>> > >> > Currently, the logic for deserialization named table is: >>>> > >> > >>>> > >> > >>>> > >> >>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129 >>>> > >> > and it will look up named tables from a user provided dictionary >>>> from >>>> > >> > string -> arrow Table. >>>> > >> > >>>> > >> > My idea is to make some short term changes to allow named tables >>>> to be >>>> > >> > dispatched differently (This logic can be reverted/removed once >>>> we >>>> > >> figure >>>> > >> > out the proper way to support custom data sources, perhaps via >>>> substrait >>>> > >> > Extensions.), specifically: >>>> > >> > >>>> > >> > (1) The user creates named table with uris for custom data >>>> source, i.e., >>>> > >> > "my_datasource://tablename?begin=20200101&end=20210101" >>>> > >> > (2) In the substrait consumer, allowing user to register custom >>>> dispatch >>>> > >> > rules based on uri scheme (similar to how exec node registry >>>> works), >>>> > >> i.e., >>>> > >> > sth like: >>>> > >> > >>>> > >> > substrait_named_table_registry.add("my_datasource", >>>> deser_my_datasource) >>>> > >> > and deser_my_datasource is a function that takes the NamedTable >>>> > >> substrait >>>> > >> > message and returns a declaration. >>>> > >> > >>>> > >> > I know doing this just for named tables might not be a very >>>> general >>>> > >> > solution but seems the easiest path forward, and we can always >>>> remove >>>> > >> this >>>> > >> > later in favor of a more generic solution. >>>> > >> > >>>> > >> > Thoughts? >>>> > >> > >>>> > >> > Li >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> >>>> wrote: >>>> > >> > >>>> > >> > > Hello! >>>> > >> > > >>>> > >> > > I am working on adding a custom data source node in Acero. I >>>> have a >>>> > >> few >>>> > >> > > previous threads related to this topic. >>>> > >> > > >>>> > >> > > Currently, I am able to register my custom factory method with >>>> Acero >>>> > >> and >>>> > >> > > create a Custom source node, i.e., I can register and execute >>>> this >>>> > >> with >>>> > >> > > Acero: >>>> > >> > > >>>> > >> > > MySourceNodeOptions source_options = ... >>>> > >> > > Declaration source{"my_source", source_option} >>>> > >> > > >>>> > >> > > The next step I want to do is to pass this through to the Acero >>>> > >> substrait >>>> > >> > > consumer. From previous discussions, I am going to use >>>> "NamedTable '' >>>> > >> as >>>> > >> > a >>>> > >> > > temporary way to define my custom data source in substrait. My >>>> > >> question >>>> > >> > is >>>> > >> > > this: >>>> > >> > > >>>> > >> > > What I need to do in substrait in order to register my own >>>> substrait >>>> > >> > > consumer rule/function for deserializing my custom named table >>>> > >> protobuf >>>> > >> > > message into the declaration above. If this is not supported >>>> right >>>> > >> now, >>>> > >> > > what is a reasonable/minimal change to make this work? >>>> > >> > > >>>> > >> > > Thanks, >>>> > >> > > Li >>>> > >> > > >>>> > >> > >>>> > >> >>>> > > >>>> >>>