rdettai commented on pull request #972:
URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-918278359


   Thanks @xudong963 for your work!
   
   @Dandandan for me `target_partitions` and `max_partitions` are **two 
different things**:
   - More and more distributed engines (especially in the cloud) are able to 
provision resources on demand according to the request. In that case, the total 
number of cores is not predefined. We don't want the `TableProvider` to *try to 
give a defined number of partitions*, but instead we want it to have its own 
partitioning strategy according to the data it has, and the scheduler will 
adapt the resources it spawns accordingly (up to a given limit that would be 
the `max_partitions`). 
   - On the other hand, on a single node engine or a preallocated cluster, we 
have a fixed amount of resources that we want to use to the best, so we want 
our `TableProvider` to *try to do everything it can to reach that partitioning*.
   
   You can find an interesting reference on datasource partitioning in Spark 
according to configs 
[here](https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7).
 Here is a summary of some interesting points:
   - different sources have different parameters: some have `min_partitions`, 
some pick up the number of cores, some use `spark.default.parallelism`... 
   - personally, I find it kind of counter intuitive that the Spark Parquet 
reader will pick up `spark.default.parallelism`, which is a global boot time 
config, and try to break up the file to have a number of partitions equal to 
that parallelism...
   
   My conclusion would be the following:
   - keep the partitioning configurations at the `TableProvider` level (you 
define them when you instantiate the datasource), to allow different 
implementations to have different options.
     - For example Parquet can be split by row groups
     - CSV files do not have row groups and we need to specify byte sizes to 
split the files
     - Some `TableProvider`s might have a partition abstraction that we could 
use as parallelism. 
     - Others `TableProvider`s (like Iceberg) will have well defined row counts 
for each file, so we could ask them to split directly by row counts instead of 
byte sizes or numbers of row groups
   
   Whatever decision we make, I have a final remark, mostly about code hygiene: 
the `scan()` method is getting bloated as we add more and more parameters. If 
we decide to follow the direction this PR is currently taking (passing the 
parameter in the `scan()` method), I think we should build a `ScanConfigs` 
struct to gather options such as `batch_size` and `target_partitions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to