rdettai commented on pull request #972: URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-918278359
Thanks @xudong963 for your work! @Dandandan for me `target_partitions` and `max_partitions` are **two different things**: - More and more distributed engines (especially in the cloud) are able to provision resources on demand according to the request. In that case, the total number of cores is not predefined. We don't want the `TableProvider` to *try to give a defined number of partitions*, but instead we want it to have its own partitioning strategy according to the data it has, and the scheduler will adapt the resources it spawns accordingly (up to a given limit that would be the `max_partitions`). - On the other hand, on a single node engine or a preallocated cluster, we have a fixed amount of resources that we want to use to the best, so we want our `TableProvider` to *try to do everything it can to reach that partitioning*. You can find an interesting reference on datasource partitioning in Spark according to configs [here](https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7). Here is a summary of some interesting points: - different sources have different parameters: some have `min_partitions`, some pick up the number of cores, some use `spark.default.parallelism`... - personally, I find it kind of counter intuitive that the Spark Parquet reader will pick up `spark.default.parallelism`, which is a global boot time config, and try to break up the file to have a number of partitions equal to that parallelism... My conclusion would be the following: - keep the partitioning configurations at the `TableProvider` level (you define them when you instantiate the datasource), to allow different implementations to have different options. - For example Parquet can be split by row groups - CSV files do not have row groups and we need to specify byte sizes to split the files - Some `TableProvider`s might have a partition abstraction that we could use as parallelism. - Others `TableProvider`s (like Iceberg) will have well defined row counts for each file, so we could ask them to split directly by row counts instead of byte sizes or numbers of row groups Whatever decision we make, I have a final remark, mostly about code hygiene: the `scan()` method is getting bloated as we add more and more parameters. If we decide to follow the direction this PR is currently taking (passing the parameter in the `scan()` method), I think we should build a `ScanConfigs` struct to gather options such as `batch_size` and `target_partitions`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
