alamb commented on pull request #972: URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-919321915
@rdettai -- this is a good discussion: I think one core challenge in our understanding is that the term `partition` is overloaded with at least two meanings: 1. A portion of the data (e.g. a file) that has individual statistics and might be read / accessed differently 2. the `partition` output of a DataFusion `ExecutionPlan` which is more like a portion of the operator's output that can be read in parallel I think the parameter in this PR is referring to the second, even though the first would likely be a better direction to take DataFusion in the longer term. Perhaps related to https://github.com/apache/arrow-datafusion/issues/64 > in the context of an engine like Buzz, where the number of CPUs is meant to be fully elastic, I would prefer to specify a partition size and no target count. I can see how specifying the size of each output stream may make sense > Spark currently accepts that no parallelism is hinted to the datasource, and in that case the datasource comes up with a partition count of its own. I find this behavior intuitive but it might be because I have been educated to do so 😄 I think it also may be related to Spark's scheduler which I think can control the number of concurrently "active" partitions. DataFusion just starts them all at once. > I would say that there isn't a 1 to 1 equivalence between parallelism and partition number. I agree -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org