Hoeze commented on issue #1273:
URL: 
https://github.com/apache/arrow-datafusion/issues/1273#issuecomment-963565892


   As an interested user, I would highly appreciate if the datafusion project 
would keep distributed query execution as a first-class citizen, in the hope 
that at some point Ballista will replace my PySpark setup.
   
   - Keeping distributed computation in mind forces the project to use scalable 
solutions that make good use of available resources
   - I can just add more machines to my cluster when I need to run my query on 
a larger scale
   
   Those two points are the main reason why I turned away from Pandas, Polars, 
etc. to PySpark.
   Even if PySpark is not as fast as e.g. Polars, it does an awesome job on 
resource management. If I run my notebooks on my laptop, it takes ~20x longer 
but it will still complete the job.
   Compared to pandas which just breaks with OOM when your final dataframe is 
larger than memory.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to