Hey Adam, 

Good question, there are outstanding JIRAs to integrate Flight [1] and HTTP/FTP 
[2] into Datasets/Filesystems. There are also some JIRAs about various RDBMSes 
[3] that could also be viewed along a Datasets lens perhaps.

Note that this work all proceeds in layers, e.g. it's the C++ query engine 
implementing groupby/join. The work here would be to integrate things into 
either the C++ Datasets or Filesystems frameworks as appropriate (e.g., create 
client libraries for RDBMSes and integrate those into Datasets, or implement 
the appropriate Datasets interfaces to wrap Flight types) and those would then 
be picked up by the query engine. Anything in Substrait can proceed in 
parallel. 

[1]: https://issues.apache.org/jira/browse/ARROW-10524
[2]: https://issues.apache.org/jira/browse/ARROW-7594
[3]: https://issues.apache.org/jira/browse/ARROW-11670

-David

On Tue, Apr 12, 2022, at 15:51, Adam Lippai wrote:
> Hi,
>
> I saw really nice features like groupby and join developed recently.
> I like how Dataset is supported for joins and how streamed processing is
> gaining momentum in Arrow.
>
> Does Apache Arrow have the concept of remote datasets eg using Arrow
> Flight? Or will this happen directly using S3 and other protocols only? I
> know some work has started in Substrait, but that might be a whole new
> level of integration, hence my question focusing on data first.
>
> I was trying to browse the JIRA issues, but the future picture wasn't clear
> based on that
>
> Best regards,
> Adam Lippai

Reply via email to