Weston - was trying the pyarrow approach you suggested:

>def custom_source(endpoint):
  return pc.Declaration("my_custom_source", create_my_custom_options())

(1) I didn't see "Declaration" under pyarrow.compute - which package is
this under?
(2) What Python object should I return with  create_my_custom_options()?
Currently I only have a C++ class for my custom option.

On Thu, Oct 13, 2022 at 12:58 PM Li Jin <ice.xell...@gmail.com> wrote:

> > I may be assuming here but I think your problem is more that there is
> no way to more flexibly describe a source in python and less that you
> need to change the default.
>
> Correct.
>
> > For example, if you could do something like this (in pyarrow) would it
> work?
> I could try to see if that works. I feel registering the extension in C++
> via one initialization seems cleaner to me because there are many other
> extension points that we initialize (add registering in the 
> default_exec_factory_registry
> similar to
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30).
> Having some of the initialization in Pyarrow and some in Acero is a bit
> complicated IMO.
>
> Maybe the proper long term fix is to add something similar to
> compute::default_exec_factory_registry for all extension points (named
> table, substrait extension message) and the user can register custom stuff
> similar to how they can register via default_exec_factory_registry?)
>
> On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com>
> wrote:
>
>> > Does that sound like a reasonable way to do this?
>>
>> It's not ideal.
>>
>> I may be assuming here but I think your problem is more that there is
>> no way to more flexibly describe a source in python and less that you
>> need to change the default.
>>
>> For example, if you could do something like this (in pyarrow) would it
>> work?
>>
>> ```
>> def custom_source(endpoint):
>>   return pc.Declaration("my_custom_source", create_my_custom_options())
>>
>> def table_provider(names):
>>   return custom_sources[names[0]]
>>
>> pa.substrait.run_query(my_plan, table_provider=table_provider)
>> ```
>>
>> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote:
>> >
>> > We did some work around this recently and think there needs to be some
>> > small change to allow users to override this default provider. I will
>> > explain in more details:
>> >
>> > (1) Since the variable is defined as static in the substrait/options.h
>> > file, each translation unit will have a separate copy of the
>> > kDefaultNamedTableProvider
>> > variable. And therefore, the user cannot really change the default that
>> is
>> > used here:
>> >
>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
>> >
>> > In order to allow user to override the kDefaultNamedTableProvider (and
>> > change the behavior of
>> >
>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
>> > to use a custom NamedTableProvider), we need to
>> > (1) in substrait/options.hh, change the definition of
>> > kDefaultNamedTableProvider to be an extern declaration
>> > (2) move the definition of kDefaultNamedTableProvider to an
>> > substrait/options.cc file
>> >
>> > We are still testing this but based on my limited C++ knowledge, I
>> > think this would allow users to do
>> > """
>> > #include "arrow/engine/substrait/options.h"
>> >
>> > void initialize() {
>> >     arrow::engine::kDefaultNamedTableProvider =
>> > some_custom_name_table_provider;
>> > }
>> > """
>> > And then calling `pa.substrat.run_query" should pick up the custom name
>> > table provider.
>> >
>> > Does that sound like a reasonable way to do this?
>> >
>> >
>> >
>> >
>> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:
>> >
>> > > Thanks both. I think NamedTableProvider is close to what I want, and
>> like
>> > > Weston said, the tricky bit is how to use a custom NamedTableProvider
>> when
>> > > calling the pyarrow substrait API.
>> > >
>> > > It's a little hacky but I *think* I can override the value
>> "kDefaultNamedTableProvider"
>> > > here and pass "table_provider=None" then it "should" work:
>> > >
>> > >
>> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66
>> > >
>> > > I am going to give that a shot once I pull/build Arrow default into
>> our
>> > > internal build system.
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <
>> bengil...@gmail.com>
>> > > wrote:
>> > >
>> > >> It seems to me that your use case could be handled by defining a
>> custom
>> > >> NamedTableProvider and
>> > >> assigning this to ConversionOptions::named_table_provider. This was
>> added
>> > >> in
>> > >> https://github.com/apache/arrow/pull/13613 to provide user
>> configurable
>> > >> dispatching for named tables;
>> > >> if it doesn't address your use case then we might want to create a
>> JIRA to
>> > >> extend it.
>> > >>
>> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com>
>> wrote:
>> > >>
>> > >> > I did some more digging into this and have some ideas -
>> > >> >
>> > >> > Currently, the logic for deserialization named table is:
>> > >> >
>> > >> >
>> > >>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
>> > >> > and it will look up named tables from a user provided dictionary
>> from
>> > >> > string -> arrow Table.
>> > >> >
>> > >> > My idea is to make some short term changes to allow named tables
>> to be
>> > >> > dispatched differently (This logic can be reverted/removed once we
>> > >> figure
>> > >> > out the proper way to support custom data sources, perhaps via
>> substrait
>> > >> > Extensions.), specifically:
>> > >> >
>> > >> > (1) The user creates named table with uris for custom data source,
>> i.e.,
>> > >> > "my_datasource://tablename?begin=20200101&end=20210101"
>> > >> > (2) In the substrait consumer, allowing user to register custom
>> dispatch
>> > >> > rules based on uri scheme (similar to how exec node registry
>> works),
>> > >> i.e.,
>> > >> > sth like:
>> > >> >
>> > >> > substrait_named_table_registry.add("my_datasource",
>> deser_my_datasource)
>> > >> > and deser_my_datasource is a function that takes the NamedTable
>> > >> substrait
>> > >> > message and returns a declaration.
>> > >> >
>> > >> > I know doing this just for named tables might not be a very general
>> > >> > solution but seems the easiest path forward, and we can always
>> remove
>> > >> this
>> > >> > later in favor of a more generic solution.
>> > >> >
>> > >> > Thoughts?
>> > >> >
>> > >> > Li
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com>
>> wrote:
>> > >> >
>> > >> > > Hello!
>> > >> > >
>> > >> > > I am working on adding a custom data source node in Acero. I
>> have a
>> > >> few
>> > >> > > previous threads related to this topic.
>> > >> > >
>> > >> > > Currently, I am able to register my custom factory method with
>> Acero
>> > >> and
>> > >> > > create a Custom source node, i.e., I can register and execute
>> this
>> > >> with
>> > >> > > Acero:
>> > >> > >
>> > >> > > MySourceNodeOptions source_options = ...
>> > >> > > Declaration source{"my_source", source_option}
>> > >> > >
>> > >> > > The next step I want to do is to pass this through to the Acero
>> > >> substrait
>> > >> > > consumer. From previous discussions, I am going to use
>> "NamedTable ''
>> > >> as
>> > >> > a
>> > >> > > temporary way to define my custom data source in substrait. My
>> > >> question
>> > >> > is
>> > >> > > this:
>> > >> > >
>> > >> > > What I need to do in substrait in order to register my own
>> substrait
>> > >> > > consumer rule/function for deserializing my custom named table
>> > >> protobuf
>> > >> > > message into the declaration above. If this is not supported
>> right
>> > >> now,
>> > >> > > what is a reasonable/minimal change to make this work?
>> > >> > >
>> > >> > > Thanks,
>> > >> > > Li
>> > >> > >
>> > >> >
>> > >>
>> > >
>>
>

Reply via email to