alamb opened a new issue, #4349: URL: https://github.com/apache/arrow-datafusion/issues/4349
# Related https://github.com/apache/arrow-ballista/issues/479 https://github.com/apache/arrow-datafusion/issues/4346 # Introduction "Configuration" in DataFusion has a few usecases: * set a values from a string provided by the user (`set XX = YY` in datafusion-cli) * set a value from environment variables (See https://docs.rs/datafusion/14.0.0/datafusion/config/struct.ConfigOptions.html#method.from_env) * display as a string (e.g. `SHOW` in datafusion-cli`) * Documented in the datafusion docs: * serialize / deserialize over the network (e.g. Ballista) * settable programmatically (e.g. via the dataframe API) There are effectively two overlapping levels of configuration: * Session level (e.g. that can be reused from one query execution to the next) * Statement level (e.g. that is needed to plan a query and doesn't change for the duration of a statement such as "the value of now()" and target batch sizes, etc) # Current state of configuration in DataFusion The current state is .... inconsistent to put it mildly. The core structure is "SessionContext" which is the final glue that puts a datafusion session (e.g. tables provided, etc) and serves as the high level entrypoint into most DataFusion functionality There is then some combinationof SessionState, SessionConfig, ConfigOptions, SessionContext -- Has a SessionState -- SessionStartTime -- SessionConfig -- ConfigOptions `SessionConfig` is effectively the Session level configuration I describe above. TaskContext is meant to be the statement level (aka per task / per query) level context. If you look hard you can see has a copy of the SessionConfig (buried in TaskPropertoes) or also maybe is backed by KVPairs. # Desire I would like to have a clear configuration story that clearly separated the statement level config from the task level config and made all configuration values easy to introspect programatically Recommendations This is a complicated issue and I don't have a magic answer. However I have some concrete suggestions Some suggested steps: - [] Remove the `KFPairs` stuff in TaskProperties as that is not used by ballista https://github.com/search?q=repo%3Aapache%2Farrow-ballista%20KVPairs&type=code and has been superceded by ConfigOptions and the TaskProperties wrappee - [ ] Consolidate SessionConfig / ConfigOptions (my preference would be to inline ConfigOptions in SessionConfig I think consolidating SessionConfig/Config options is likely to be the most controversial / cause the most chrun but it will provide immense benefits I think. Then we can further improve from there -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
