[GitHub] [arrow-ballista] yahoNanJing commented on issue #30: [Discuss] Ballista Future Direction

GitBox Sun, 22 May 2022 20:02:42 -0700


yahoNanJing commented on issue #30:
URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1134117646


   > 1. How the ballista cluster to be deployed
   
   Since k8s is so common and popular today, I think we should support it 
natively. Like Spark for batch processing, we can provide a way to deploy the 
ballista by k8s **on demand**.
   
   > 2. Should the ballista cluster work in a long running way
   
   I prefer the cluster as a standalone system and runs in a long running way. 
For a long running system, we have to pay much attention to several aspects, 
like avoiding memory leaks, managing historical states, etc. While for Spark, 
it previously focuses on batch processing rather than long running system. Then 
its cluster always be destroyed after the batch processing finishes and it 
doesn't need to pay much attention to the long running related aspects.
   
   > 3. Should the tasks work in a long running way
   
   For streaming engines, like Flink, the tasks always work in a long running 
way. It will bring other challenges. The interests of my teams are not on this. 
We will focus on latency-sensitive interactive queries.
   
   >  4. Which way to do the data exchange, push-based or pull-based
   
   The pull-based way may not be as efficient as the push-based way. However, 
the push-based way needs the whole task pipeline topology to be determined 
before task execution. While for the pull-based way, like the Spark employs, 
AQE can be introduced to make is possible to change the whole query plan 
adaptively during query execution. One coin has two sides. Therefore, I propose 
to implement both.
   - Pull-based good for ad-hoc queries
   - Push-based good for latency-sensitive interactive queries or tasks
   
   > 5. Should the exchanged data be flushed to disk
   
   It also depends. Flushing to disk will be good for error recovering and easy 
memory management. However, it's not efficient. Therefore, I also propose to 
implement both.
   - Flushing to disk, good for batch processing with pull-based data change
   - Not flushing to disk, good for latency-sensitive queries with push-based 
data change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-ballista] yahoNanJing commented on issue #30: [Discuss] Ballista Future Direction

Reply via email to