alamb commented on pull request #972: URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-918407791
@rdettai > More and more distributed engines (especially in the cloud) are able to provision resources on demand according to the request. In that case, the total number of cores is not predefined. We don't want the TableProvider to try to give a defined number of partitions, but instead we want it to have its own partitioning strategy according to the data it has... I think of `target_partitions` as "target concurrency" (aka how many cores could you possibly keep busy at any time) and I think like you believe that the decisions of how to actually split up data across the partitions exposed to DataFusion should be an implementation detail of the `TableProvider`. So a system that uses DataFusion would set `target_partitions` based on how many CPU resources it wanted DataFusion to try and consume. Various TableProvider implementations could take that target into advisement as they divided up their data. > My conclusion would be the following: > > keep the partitioning configurations at the TableProvider level (you define them when you instantiate the datasource), to allow different implementations to have different options. I think I would phrase it differently as "keep the choice of distribution of data across partitions at the `TableProvider` level" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
