tespent commented on issue #1103: URL: https://github.com/apache/datafusion-python/issues/1103#issuecomment-2800371392
> I am concerned about the table providers, though. I think any implementation will need to get the table provider to provide record batches efficiently. A small correction: the trait that producing record batches is `ExecutionPlan`. So table providers can be easily written in python without runtime cost. > I guess that means your python implementation will return pyarrow record batch reader. Yes. This is of course not as efficient as doing so in rust, but there isn't performance issue for me. I'm afraid I cannot share our internal code but let me explain: I am running execution plan on top of Ray Data [`Dataset.map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches), where the UDF callable can defined as `def __call__(batches: Iterable[pyarrow.Table]) -> Iterable[pyarrow.Table] | pyarrow.Table`. I handmade an execution plan in python that gathered data from input `batches` (call `yield from table.to_batches()` for each input item), and wraps output of the execution plan from `Iterable[pyarrow.RecordBatch]` to `Iterable[pyarrow.Table]`. The input iterable is generated by upstream ray data operators (and map transformers) using python's generator function. This is why non-native performance of the data source ExecutionPlan is not an issue for me. Additionally, in our specific scenario, CPU performance is not a common bottleneck, as the pipeline typically involves heavy I/O, decoding or GPU inference. In conclusion, I will say that creating a rust wrapper of ExecutionPlan specifically for my project is feasible but I think this can be reused for anyone else facing similar issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org