houqp commented on pull request #972: URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-923613899
Thanks @rdettai for the clarification. After thinking more on this, I agree with your conclusion that specifying these configs at the TableProvider constructor level would be a more flexible design. But for a different reason ;) With regards to table specific configs like `ExecutionConfig.parquet.target_partition_size` I think we can still get it work by passing `ExecutionConfig` to the scan method and let individual implementation pick which config value to read from. But passing in scan configuration at the `TableProvider` constructor level would effectively make these configs accessible from a larger portion of the query execution process. If we only pass them in to the scan method, then it will only be accessible starting from the physical planner. So I think it does give us more room for optimization in the future. Although I can't think of any specific optimization we could do with this info during logical planning. Just to make sure we are on the same page here, the proposed short term design looks like the following: * Keep the `TableProvider::scan` API and `ParquetTable::try_new` API the same as what they are in the master branch right now. * Store `target_partitions` config in `ParquetTable` struct so it can be accessed later during `ParquetTable::scan` call. * Serialize `target_partitions` config in ballista protobuf to remove the hard coded value. Then we can work on follow up PRs to migrate the code base over to center the scanning logic around partition size instead of partition count. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
