rdettai commented on pull request #972: URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-918986641
Thanks @alamb and @houqp for your insights and @xudong963 for quickly reacting to this feedback! But I am still not 100% convinced 😃 - in the context of an engine like [Buzz](https://github.com/cloudfuse-io/buzz-rust), where the number of CPUs is meant to be fully elastic, I would prefer to specify a partition size and no target count. I understand that adding `target_parition_size` could be an evolution, but it bothers me that `target_partitions` is not optional because I wouldn't know what to specify for it - Spark currently accepts that no parallelism is hinted to the datasource, and in that case the datasource comes up with a partition count of its own. I find this behavior intuitive but it might be because I have been educated to do so 😄 > I think of target_partitions as "target concurrency" - I would say that there isn't a 1 to 1 equivalence between parallelism and partition number. Usually, the partition number can be much larger than the parallelism and tasks for extra partitions will be queued. So if we mean to hint a "target concurrency" to the table providers, I think we should name this configuration as such. - this is a personal opinion, but I am usually septic of global parameters that are meant to be interpreted differently by different implementations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
