frobnitzem commented on issue #1273:
URL: 
https://github.com/apache/arrow-datafusion/issues/1273#issuecomment-965654603


   I am coming to this discussion from a different perspective.  I recently 
wrote a py-sparkling](https://github.com/svenkreiss/pysparkling)-inspired RDD 
that distributes list slices over MPI (https://github.com/frobnitzem/mpi_list).
   
   However, loading up pandas dataframes quickly wastes memory.  So, I'm 
investigating arrow as a memory-friendly replacement.  The trouble I am running 
into is that DataFusion might have too much functionality.  My csv files are 
already split up (many per process), and I already have processes running on an 
existing cluster via MPI.  So I want to execute SQL queries once for each csv 
file and create a new result dataset distributed the same way as the original.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to