Re: [I] Add CatalogProvider API [datafusion-python]

via GitHub Sun, 13 Apr 2025 20:17:51 -0700


tespent commented on issue #1103:
URL: 
https://github.com/apache/datafusion-python/issues/1103#issuecomment-2800371392


   > I am concerned about the table providers, though. I think any 
implementation will need to get the table provider to provide record batches 
efficiently.
   
   A small correction: the trait that producing record batches is 
`ExecutionPlan`. So table providers can be easily written in python without 
runtime cost.
   
   > I guess that means your python implementation will return pyarrow record 
batch reader.
   
   Yes. This is of course not as efficient as doing so in rust, but there isn't 
performance issue for me. I'm afraid I cannot share our internal code but let 
me explain:
   
   I am running execution plan on top of Ray Data 
[`Dataset.map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches),
 where the UDF callable can defined as `def __call__(batches: 
Iterable[pyarrow.Table]) -> Iterable[pyarrow.Table] | pyarrow.Table`. I 
handmade an execution plan in python that gathered data from input `batches` 
(call `yield from table.to_batches()` for each input item), and wraps output of 
the execution plan from `Iterable[pyarrow.RecordBatch]` to 
`Iterable[pyarrow.Table]`.
   
   The input iterable is generated by upstream ray data operators (and map 
transformers) using python's generator function. This is why non-native 
performance of the data source ExecutionPlan is not an issue for me. 
Additionally, in our specific scenario, CPU performance is not a common 
bottleneck, as the pipeline typically involves heavy I/O, decoding or GPU 
inference. In conclusion, I will say that creating a rust wrapper of 
ExecutionPlan specifically for my project is feasible but I think this can be 
reused for anyone else facing similar issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add CatalogProvider API [datafusion-python]

Reply via email to