[GitHub] [arrow-datafusion] Hoeze commented on issue #1273: Question: Is the Ballista project providing value to the overall DataFusion project?

GitBox Mon, 08 Nov 2021 12:57:15 -0800


Hoeze commented on issue #1273:
URL: 
https://github.com/apache/arrow-datafusion/issues/1273#issuecomment-963565892



   As an interested user, I would highly appreciate if the datafusion project 
would keep distributed query execution as a first-class citizen, in the hope 
that at some point Ballista will replace my PySpark setup.
   
   - Keeping distributed computation in mind forces the project to use scalable 
solutions that make good use of available resources
   - I can just add more machines to my cluster when I need to run my query on 
a larger scale
   
   Those two points are the main reason why I turned away from Pandas, Polars, 
etc. to PySpark.
   Even if PySpark is not as fast as e.g. Polars, it does an awesome job on 
resource management. If I run my notebooks on my laptop, it takes ~20x longer 
but it will still complete the job.
   Compared to pandas which just breaks with OOM when your final dataframe is 
larger than memory.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Hoeze commented on issue #1273: Question: Is the Ballista project providing value to the overall DataFusion project?

Reply via email to