alamb opened a new issue, #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349

   
   # Related
   https://github.com/apache/arrow-ballista/issues/479
   https://github.com/apache/arrow-datafusion/issues/4346
   
   # Introduction
   
   "Configuration" in DataFusion has a few usecases:
   * set a values from a string provided by the user (`set XX = YY` in 
datafusion-cli)
   * set a value from environment variables (See 
https://docs.rs/datafusion/14.0.0/datafusion/config/struct.ConfigOptions.html#method.from_env)
   * display as a string (e.g. `SHOW` in datafusion-cli`)
   * Documented in the datafusion docs:
   * serialize / deserialize over the network (e.g. Ballista)
   * settable programmatically (e.g. via the dataframe API)
   
   There are effectively two overlapping levels of configuration:
   * Session level (e.g. that can be reused from one query execution to the 
next)
   * Statement level (e.g. that is needed to plan a query and doesn't change 
for the duration of a statement such as "the value of now()" and target batch 
sizes, etc)
   
   
   # Current state of configuration in DataFusion
   
   The current state is .... inconsistent to put it mildly.
   
   The core structure is "SessionContext" which is the final glue that puts a 
datafusion session (e.g. tables provided, etc) and serves as the high level 
entrypoint into most DataFusion functionality
   
   There is then some combinationof SessionState, SessionConfig, ConfigOptions,
   
   
   SessionContext
    -- Has a SessionState
    -- SessionStartTime
    -- SessionConfig
      -- ConfigOptions
   
   
   `SessionConfig` is effectively the Session level configuration I describe 
above.
   
   TaskContext is  meant to be the statement level (aka per task / per query) 
level context. If you look hard  you can see has a copy of the SessionConfig 
(buried in TaskPropertoes) or also maybe is backed by KVPairs.
   
   # Desire
   I would like to have a clear configuration story that clearly separated the 
statement level config from the task level config and made all configuration 
values easy to introspect programatically
   
   Recommendations
   This is a complicated issue and I don't have a magic answer. However I have 
some concrete suggestions
   
   Some suggested steps:
   - [] Remove the `KFPairs` stuff in TaskProperties as that is not used by 
ballista 
https://github.com/search?q=repo%3Aapache%2Farrow-ballista%20KVPairs&type=code 
and has been superceded by ConfigOptions and the TaskProperties wrappee
   - [ ] Consolidate SessionConfig / ConfigOptions (my preference would be to 
inline ConfigOptions in SessionConfig
   
   I think consolidating SessionConfig/Config options is likely to be the most 
controversial / cause the most chrun but it will provide immense benefits I 
think.
   
   Then we can further improve from there
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to