frobnitzem commented on issue #1273: URL: https://github.com/apache/arrow-datafusion/issues/1273#issuecomment-965654603
I am coming to this discussion from a different perspective. I recently wrote a py-sparkling](https://github.com/svenkreiss/pysparkling)-inspired RDD that distributes list slices over MPI (https://github.com/frobnitzem/mpi_list). However, loading up pandas dataframes quickly wastes memory. So, I'm investigating arrow as a memory-friendly replacement. The trouble I am running into is that DataFusion might have too much functionality. My csv files are already split up (many per process), and I already have processes running on an existing cluster via MPI. So I want to execute SQL queries once for each csv file and create a new result dataset distributed the same way as the original. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org