To do this without using network or storage, your processes have to be able to access the same memory.
Then, I think you want to use something like SessionContext::register_table [1] to then call SessionContext::read_table [2] on. Ultimately, instead of read_csv [3] which extracts data from a CSV file into a Result<DataFrame>, you want something that takes a Table (Vec<RecordBatch>) and returns a Result<DataFrame>. Then, all the other commands can progress as is typical. At least, that's my impression. But, I couldn't find a good description of how to define a TableProvider or use them. Maybe you can parse the Custom Table Provider docs [4] better than I can. [1]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.register_table [2]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_table [3]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_csv [4]: https://arrow.apache.org/datafusion/library-user-guide/custom-table-providers.html # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Thursday, February 8th, 2024 at 13:43, Chak-Pong Chung <chakpongch...@gmail.com> wrote: > Hi the arrow community, > > > https://stackoverflow.com/q/77964825/1611102 > > Trying to get some attention to this performance question.
publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature