alamb commented on issue #1221: URL: https://github.com/apache/arrow-datafusion/issues/1221#issuecomment-968850180
> Can you please explain the "data locality" requirements a little more ? I think for normal source tasks which read data from remote storage(cloud storage or Hdfs), there is no data locality. And for shuffle readers which have to read data from all map tasks, there is no data locality either. I was thinking of a plan such as the following. There may be cases when reshuffling between scan/filter and aggregate is worthwhile (e.g. to distribute the load better) I think the cost of reshuffling will mostly end up dominating any savings ``` rest of plan │ │ │ ┌ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ┐ ▼ │ ┌───────────────────┐ │ │ HashAggregate │ │ └───────────────────┘ │ │ Data is not reshuffled │ │ │ between scan, filter and ▼ ◀ ─ ─ ─ ─ ─ aggregate │ ┌───────────────────┐ │ │ Filter │ │ └───────────────────┘ │ │ │ │ │ ▼ │ ┌───────────────────┐ │ │ TableScan │ │ └───────────────────┘ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org