To do this without using network or storage, your processes have to be able to 
access the same memory.

Then, I think you want to use something like SessionContext::register_table [1] 
to then call SessionContext::read_table [2] on. Ultimately, instead of read_csv 
[3] which extracts data from a CSV file into a Result<DataFrame>, you want 
something that takes a Table (Vec<RecordBatch>) and returns a 
Result<DataFrame>. Then, all the other commands can progress as is typical.

At least, that's my impression. But, I couldn't find a good description of how 
to define a TableProvider or use them. Maybe you can parse the Custom Table 
Provider docs [4] better than I can.


[1]: 
https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.register_table
[2]: 
https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_table
[3]: 
https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_csv
[4]: 
https://arrow.apache.org/datafusion/library-user-guide/custom-table-providers.html



# ------------------------------

# Aldrin


https://github.com/drin/
https://gitlab.com/octalene
https://keybase.io/octalene


On Thursday, February 8th, 2024 at 13:43, Chak-Pong Chung 
<chakpongch...@gmail.com> wrote:

> Hi the arrow community,
>
>
> https://stackoverflow.com/q/77964825/1611102
>
> Trying to get some attention to this performance question.

Attachment: publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to