houqp edited a comment on pull request #972:
URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-923613899


   Thanks @rdettai for the clarification. After thinking more on this, I agree 
with your conclusion that specifying these configs at the TableProvider 
constructor level would be a more flexible design. But for a different reason ;)
   
   With regards to table specific configs like 
`ExecutionConfig.parquet.target_partition_size` I think we can still get it 
work by passing `ExecutionConfig` to the scan method and let individual 
implementation pick which config value to read from.
   
   But passing in scan configuration at the `TableProvider` constructor level 
would effectively make these configs accessible from a larger portion of the 
query execution process. If we only pass them in to the scan method, then it 
will only be accessible starting from the physical planner. So I think it does 
give us more room for optimization in the future. Although I can't think of any 
specific optimization we could do with this info during logical planning.
   
   Just to make sure we are on the same page here, the proposed short term 
design looks like the following:
   * Keep the `TableProvider::scan` API and `ParquetTable::try_new` API the 
same as what they are in the master branch right now. 
   * Store `target_partitions` config in `ParquetTable` struct so it can be 
accessed later during `ParquetTable::scan` call. 
   * Serialize `target_partitions` config in ballista protobuf to remove the 
hard coded value. 
   
   Then we can work on follow up PRs to migrate the code base over to center 
the scanning logic around partition size instead of partition count.
   
   NOTE: with regards to file format specific scan config, I think we might 
need to come up with a more dynamic design instead of using static fields in 
`ExecutionConfig` struct, otherwise it would make it very hard for people to 
add custom formats as plugins.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to