[jira] [Commented] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

Weston Pace (Jira) Fri, 14 Oct 2022 10:58:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617896#comment-17617896
 ]


Weston Pace commented on ARROW-18063:
-------------------------------------

{quote}
Refactor NamedTableProvider from a lambda mapping names -> data source into a 
registry so that data source factories can be added from c++ then referenced by 
name from python
{quote}

I'm not sure this is exactly what has been proposed.  Instead I think the idea 
is that default named table provider is either a property of the 
ExecFactoryRegistry or part of some larger "AceroContext".  A user can then 
configure which named table provider to use by grabbing the default context and 
setting the named table provider at the same time they grab the default context 
and add exec factories.

There is then no python reference or bindings needed at all.

I think this is a reasonable solution (I prefer AceroContext over MetaRegistry 
which was mentioned in the ML threads).

One does then have to consider what happens if two processes or calls are made 
to configure the default named table provider.  I think the simplest option 
would be to just overwrite it.  It might be slightly nicer to throw an error 
when setting the default named table provider if it has already been set.  
There are more complex alternatives such as a named table provider registry or 
a chain of named table providers but I'm not sure they are needed in this case.

CC [~icexelloss] to confirm.

> [C++][Python] Custom streaming data providers in {{run_query}}
> --------------------------------------------------------------
>
>                 Key: ARROW-18063
>                 URL: https://issues.apache.org/jira/browse/ARROW-18063
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> [Mailing list 
> thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]
> The goal is to:
> - generate a substrait plan in Python using Ibis
> - ... wherein tables are specified using custom URLs
> - use the python API {{run_query}} to execute the plan
> - ... against source data which is *streamed* from those URLs rather than 
> pulled fully into local memory
> The obstacles include:
> - The API for constructing a data stream from the custom URLs is only 
> available in c++
> - The python {{run_query}} function requires tables as input and cannot 
> accept a RecordBatchReader even if one could be constructed from a custom URL
> - Writing custom cython is not preferred
> Some potential solutions:
> - Use ExecuteSerializedPlan() directly usable from c++ so that construction 
> of data sources need not be handled in python. Passing a buffer from 
> python/ibis down to C++ is much simpler and can be navigated without writing 
> cython
> - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} 
> into a registry so that data source factories can be added from c++ then 
> referenced by name from python
> - Extend {{run_query}} to support non-Table sources and require the user to 
> write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

Reply via email to