[GitHub] [arrow-ballista] dariusgm commented on issue #30: [Discuss] Ballista Future Direction

GitBox Tue, 14 Jun 2022 13:44:20 -0700


dariusgm commented on issue #30:
URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1155694167


   What are the downsides of apache spark, why somebody should use ballista?
   
   imo as far as I read posts and watched a talk on yt, the memory consumption 
of spark is huge. Even a hello world in spark will consume a lot of memory. The 
reduced memory consumption can be a big advantage for ballista. I always have 
to increase the memory of our production spark jobs - sometimes to even 32GB 
per executor.
   
   What can do apache spark good?
   
   Its integration very good with the hadoop distributed file system. This is 
from my perspective the big advantage: the computation is taken place as close 
to the data as possible, moving only data when really required. Again, as far 
as I understand, this is currently not possible with ballista?
   
   And, as I am a data engineer and doing a lot of analytics, I really like the 
spark-shell to play around with the data. 
   
   I just started using rust, so maybe I am not the biggest help for 
implementing features, but I know spark from a developer perspective quite well 
as I am using it daily.
   
   What could be a direction: Have the data mostly/only in ram in the cluster, 
reducing the (slow) HDD/SSD reads and running the computation than on the arrow 
In-memory data frames. That maybe a niche to fit into. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-ballista] dariusgm commented on issue #30: [Discuss] Ballista Future Direction

Reply via email to