Hoeze edited a comment on issue #1273: URL: https://github.com/apache/arrow-datafusion/issues/1273#issuecomment-963565892
As an interested user, I would highly appreciate if the datafusion project would keep distributed query execution as a first-class citizen, in the hope that at some point Ballista will replace my PySpark setup. - Keeping distributed computation in mind forces the project to use scalable solutions that make good use of available resources - I can just add more machines to my cluster when I need to run my query on a larger scale Those two points are the main reason why I turned away from Pandas, Polars, etc. to PySpark. Even if PySpark is not as fast as e.g. Polars, it does an awesome job on resource management. If I run my scripts on my laptop, it takes ~20x longer but it will still complete the job. Compared to pandas which just breaks with OOM when your intermediate dataframe is larger than memory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
