yahoNanJing commented on issue #30: URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1134117646
> 1. How the ballista cluster to be deployed Since k8s is so common and popular today, I think we should support it natively. Like Spark for batch processing, we can provide a way to deploy the ballista by k8s **on demand**. > 2. Should the ballista cluster work in a long running way I prefer the cluster as a standalone system and runs in a long running way. For a long running system, we have to pay much attention to several aspects, like avoiding memory leaks, managing historical states, etc. While for Spark, it previously focuses on batch processing rather than long running system. Then its cluster always be destroyed after the batch processing finishes and it doesn't need to pay much attention to the long running related aspects. > 3. Should the tasks work in a long running way For streaming engines, like Flink, the tasks always work in a long running way. It will bring other challenges. The interests of my teams are not on this. We will focus on latency-sensitive interactive queries. > 4. Which way to do the data exchange, push-based or pull-based The pull-based way may not be as efficient as the push-based way. However, the push-based way needs the whole task pipeline topology to be determined before task execution. While for the pull-based way, like the Spark employs, AQE can be introduced to make is possible to change the whole query plan adaptively during query execution. One coin has two sides. Therefore, I propose to implement both. - Pull-based good for ad-hoc queries - Push-based good for latency-sensitive interactive queries or tasks > 5. Should the exchanged data be flushed to disk It also depends. Flushing to disk will be good for error recovering and easy memory management. However, it's not efficient. Therefore, I also propose to implement both. - Flushing to disk, good for batch processing with pull-based data change - Not flushing to disk, good for latency-sensitive queries with push-based data change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
