[GitHub] [arrow-datafusion] jon-chuang edited a comment on issue #1221: Task assignment between Scheduler and Executors

GitBox Fri, 14 Jan 2022 12:20:50 -0800


jon-chuang edited a comment on issue #1221:
URL: 
https://github.com/apache/arrow-datafusion/issues/1221#issuecomment-1013068782



   @yjshen thanks for your questions
   
   > task scheduling, keepalive monitoring, struggler detection, and 
speculative task execution\
   
   - yes. 
   - yes and failure recovery at task level. We also have a worker monitoring 
dashboard with basic resource utilization info.
   - we do not have robust tracing tools yet, but it is planned. As for 
scheduling, it does not currently take into account global information like 
straggling in an execution DAG and try to prioritize bottlenecked tasks. 
However, we are looking into priority mechanism for tasks, through which a user 
(or external monitoring tool) could prioritize bottlenecked tasks.
   - Note that Ray will always try to schedule tasks if there are resources 
available. So if the dataframe/SQL operation does not have an all-to-all 
dependency, it will automatically proceed to the next stage. We also have plans 
to preempt workers in anticipation of OOM. 
   
   > Therefore I could easily build a distributed SQL engine on top of 
DataFusion with little effort?
   
   This is unclear to me, and requires more investigation. However, note that 
the distributed dataframe project Modin was built on top of Ray.
   
   > the code to distribute and run is quite limited, it's all about 
DataFusion's limited number of physical operators.
   
   Yes. I think the use-case is perhaps for incremental and interactive SQL 
queries that can take advantage of low-latency scheduling. For instance, 
backend serving for many (> 10-100Kps) queries over a distributed dataset.
   
   I think these workloads might currently be out of scope for Ballista, which 
is aimed at analytics just like Spark is, but it is interesting to consider. 
   
   For instance, time series DBs and [Materialize 
DB](https://github.com/MaterializeInc/materialize) offer this sort of 
incremental SQL computation. Also consider something like NoriaDB which is 
optimized for read-heavy serving workloads and offers incremental SQL 
computation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] jon-chuang edited a comment on issue #1221: Task assignment between Scheduler and Executors

Reply via email to