[GitHub] [arrow-datafusion] alamb commented on issue #1221: Task assignment between Scheduler and Executors

GitBox Mon, 15 Nov 2021 04:12:33 -0800


alamb commented on issue #1221:
URL: 
https://github.com/apache/arrow-datafusion/issues/1221#issuecomment-968850180



   > Can you please explain the "data locality" requirements a little more ? I 
think for normal source tasks which read data from remote storage(cloud storage 
or Hdfs), there is no data locality. And for shuffle readers which have to read 
data from all map tasks, there is no data locality either.
   
   
   I was thinking of a plan such as the following. There may be cases when 
reshuffling between scan/filter and aggregate is worthwhile (e.g. to distribute 
the load better) I think the cost of reshuffling will mostly end up dominating 
any savings
   
   ```
                                                                         
                                                                         
           rest of plan                                                  
                                                                         
                                                                         
                 │                                                       
                 │                                                       
                 │                                                       
   ┌ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ┐                                       
                 ▼                                                       
   │   ┌───────────────────┐     │                                       
       │   HashAggregate   │                                             
   │   └───────────────────┘     │                                       
                 │                               Data is not reshuffled  
   │             │               │              between scan, filter and 
                 ▼                  ◀ ─ ─ ─ ─ ─        aggregate         
   │   ┌───────────────────┐     │                                       
       │      Filter       │                                             
   │   └───────────────────┘     │                                       
                 │                                                       
   │             │               │                                       
                 ▼                                                       
   │   ┌───────────────────┐     │                                       
       │     TableScan     │                                             
   │   └───────────────────┘     │                                       
                                                                         
   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘                                       
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1221: Task assignment between Scheduler and Executors

Reply via email to