timsaucer commented on issue #1032: URL: https://github.com/apache/datafusion-python/issues/1032#issuecomment-2675237730
For the high level abstractions, I believe these are already met. The DataFrame API is available and widely used (in fact, its the only way I personally use it). The [common operations online documentation](https://datafusion.apache.org/python/user-guide/common-operations/index.html) has a handful of sub-pages that describe usage of the API, as well as in the [API reference](https://datafusion.apache.org/python/autoapi/datafusion/dataframe/index.html#datafusion.dataframe.DataFrame). DataFusion does already use a lazy evaluation mode. For the integration with Pandas and Polars, support for this exists and is described in the [data sources](https://datafusion.apache.org/python/user-guide/data-sources.html) page. For the efficient batch processing leveraging Arrow's memory format, that is how DataFusion operates currently. For the PyO3 interface, I'm not familiar with what optimizations you have in mind to reduce overhead. I'd be curious where you think we have issues currently. I'd also love to hear if you have ideas about optimizing the data movement between Python and Rust. This is a difficult problem, but we do already leverage the pyarrow FFI interface to avoid many of the data translation inefficiencies. Parallel execution is also already supported, but there are additional efforts like `datafusion-ray` and `ballista` where we push the envelope much further by going into distributed processing. Those are under heavy/active development right now and also a very good place to make contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org