frobnitzem edited a comment on issue #1273:
URL: 
https://github.com/apache/arrow-datafusion/issues/1273#issuecomment-965654603


   I am coming to this discussion from a different perspective.  I recently 
wrote [mpi-list](https://github.com/frobnitzem/mpi_list), a 
[py-sparkling](https://github.com/svenkreiss/pysparkling)-inspired RDD that 
distributes list slices over MPI.  Each process works with its local (sub-list) 
of elements.  Usually, each element is a dataframe.
   
   However, loading up pandas dataframes quickly wastes memory.  So, I'm 
investigating arrow as a memory-friendly replacement.  The trouble I am running 
into is that DataFusion might have too much functionality.  My csv files are 
already split up (many per process), and I already have processes running on an 
existing cluster via MPI.  So I want to execute SQL queries once for each csv 
file and create a new result dataset distributed the same way as the original.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to