Thanks both. I think NamedTableProvider is close to what I want, and like
Weston said, the tricky bit is how to use a custom NamedTableProvider when
calling the pyarrow substrait API.

It's a little hacky but I *think* I can override the value
"kDefaultNamedTableProvider"
here and pass "table_provider=None" then it "should" work:
https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66

I am going to give that a shot once I pull/build Arrow default into our
internal build system.




On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <bengil...@gmail.com>
wrote:

> It seems to me that your use case could be handled by defining a custom
> NamedTableProvider and
> assigning this to ConversionOptions::named_table_provider. This was added
> in
> https://github.com/apache/arrow/pull/13613 to provide user configurable
> dispatching for named tables;
> if it doesn't address your use case then we might want to create a JIRA to
> extend it.
>
> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> > I did some more digging into this and have some ideas -
> >
> > Currently, the logic for deserialization named table is:
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
> > and it will look up named tables from a user provided dictionary from
> > string -> arrow Table.
> >
> > My idea is to make some short term changes to allow named tables to be
> > dispatched differently (This logic can be reverted/removed once we figure
> > out the proper way to support custom data sources, perhaps via substrait
> > Extensions.), specifically:
> >
> > (1) The user creates named table with uris for custom data source, i.e.,
> > "my_datasource://tablename?begin=20200101&end=20210101"
> > (2) In the substrait consumer, allowing user to register custom dispatch
> > rules based on uri scheme (similar to how exec node registry works),
> i.e.,
> > sth like:
> >
> > substrait_named_table_registry.add("my_datasource", deser_my_datasource)
> > and deser_my_datasource is a function that takes the NamedTable substrait
> > message and returns a declaration.
> >
> > I know doing this just for named tables might not be a very general
> > solution but seems the easiest path forward, and we can always remove
> this
> > later in favor of a more generic solution.
> >
> > Thoughts?
> >
> > Li
> >
> >
> >
> >
> >
> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > > Hello!
> > >
> > > I am working on adding a custom data source node in Acero. I have a few
> > > previous threads related to this topic.
> > >
> > > Currently, I am able to register my custom factory method with Acero
> and
> > > create a Custom source node, i.e., I can register and execute this with
> > > Acero:
> > >
> > > MySourceNodeOptions source_options = ...
> > > Declaration source{"my_source", source_option}
> > >
> > > The next step I want to do is to pass this through to the Acero
> substrait
> > > consumer. From previous discussions, I am going to use "NamedTable ''
> as
> > a
> > > temporary way to define my custom data source in substrait. My question
> > is
> > > this:
> > >
> > > What I need to do in substrait in order to register my own substrait
> > > consumer rule/function for deserializing my custom named table protobuf
> > > message into the declaration above. If this is not supported right now,
> > > what is a reasonable/minimal change to make this work?
> > >
> > > Thanks,
> > > Li
> > >
> >
>

Reply via email to