alamb commented on pull request #972:
URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-919321915


   @rdettai -- this is a good discussion:
   
   I think one core challenge in our understanding is that the term `partition` 
is overloaded with at least two meanings:
   1. A portion of the data (e.g. a file) that has individual statistics and 
might be read / accessed differently
   2. the `partition` output of a DataFusion `ExecutionPlan` which is more like 
a portion of the operator's output that can be read in parallel
   
   I think the parameter in this PR is referring to the second, even though the 
first would likely be a better direction to take DataFusion in the longer term. 
Perhaps related to https://github.com/apache/arrow-datafusion/issues/64
   
   > in the context of an engine like Buzz, where the number of CPUs is meant 
to be fully elastic, I would prefer to specify a partition size and no target 
count. 
   
   I can see how specifying the size of each output stream may make sense
   
   >  Spark currently accepts that no parallelism is hinted to the datasource, 
and in that case the datasource comes up with a partition count of its own. I 
find this behavior intuitive but it might be because I have been educated to do 
so 😄
   
   I think it also may be related to Spark's scheduler which I think can 
control the number of concurrently "active" partitions. DataFusion just starts 
them all at once. 
   
   > I would say that there isn't a 1 to 1 equivalence between parallelism and 
partition number. 
   
   I agree


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to