In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.
On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > Also, I would like to hear other people' thoughts on here. Could I ask > what you guys think about this in general? > > 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: > >> To do that, we should explicitly document such structured configuration >> and implicit effect, which is currently missing. >> I would be more than happy if we document such implied relationship, >> *and* if we are very sure all configurations are structured correctly >> coherently. >> Until that point, I think it might be more practical to simply document >> it for now. >> >> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue >> on practice - whether to duplicate description between configuration code >> and doc. I have been asked to add description on configuration code >> regardlessly, and existing codebase doesn't. This configuration is >> widely-used one. >> This is actually something we should fix too. in SQL configuration, now >> we don't have such duplications as of >> https://github.com/apache/spark/pull/27459 as it generates. We should do >> it in other configurations. >> >> >> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >> 작성: >> >>> I'm looking into the case of `spark.dynamicAllocation` and this seems to >>> be the thing to support my voice. >>> >>> >>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation >>> >>> I don't disagree with adding "This requires >>> spark.shuffle.service.enabled to be set." in the description of >>> `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, >>> hence it should be better to have it. >>> >>> Why I'm in favor of structured configuration & implicit effect over >>> describing everything explicitly is there. >>> >>> 1. There're 10 configurations (if the doc doesn't miss any other >>> configuration) except `spark.dynamicAllocation.enabled`, and only 4 >>> configurations are referred in the description of >>> `spark.dynamicAllocation.enabled` - majority of config keys are missing. >>> 2. I think it's intentional, but the table starts >>> with `spark.dynamicAllocation.enabled` which talks implicitly but >>> intuitively that if you disable this then everything on dynamic allocation >>> won't work. Missing majority of references on config keys don't get it hard >>> to understand. >>> 3. Even `spark.dynamicAllocation` has bad case - see >>> `spark.dynamicAllocation.shuffleTracking.enabled` and >>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the >>> structure of configuration. I think this is worse than not explicitly >>> mentioning the description. Let's assume the name has >>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive >>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` >>> would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`? >>> >>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue >>> on practice - whether to duplicate description between configuration code >>> and doc. I have been asked to add description on configuration code >>> regardlessly, and existing codebase doesn't. This configuration is >>> widely-used one. >>> >>> >>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> >>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it >>>> although it might be "redundant" :-) since anyone can give feedback to any >>>> thread in Spark dev mailing list, and discuss. >>>> >>>> This is actually more prevailing given my rough reading of >>>> configuration files. I would like to see this missing relationship as a bad >>>> pattern, started from a personal preference. >>>> >>>> > Personally I'd rather not think someone won't understand setting >>>> `.enabled` to `false` means the functionality is disabled and effectively >>>> it disables all sub-configurations. >>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the >>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this >>>> is pretty intuitive and the one of major >>>> > benefits of the structured configurations. >>>> >>>> I don't think this is a good idea we assume for users to know such >>>> contexts. One might think >>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can >>>> partially enable the feature. It is better to be explicit to document >>>> since some of configurations are even difficult for users to confirm if it >>>> is working or not. >>>> For instance, one might think setting >>>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, >>>> they realise the log is not rolling later after the file >>>> size becomes bigger. >>>> >>>> >>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>>> 작성: >>>> >>>>> I'm sorry if I miss something, but this is ideally better to be >>>>> started as [DISCUSS] as I haven't seen any reference to have consensus on >>>>> this practice. >>>>> >>>>> For me it's just there're two different practices co-existing on the >>>>> codebase, meaning it's closer to the preference of individual (with >>>>> implicitly agreeing that others have different preferences), or it hasn't >>>>> been discussed thoughtfully. >>>>> >>>>> Personally I'd rather not think someone won't understand setting >>>>> `.enabled` to `false` means the functionality is disabled and effectively >>>>> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` >>>>> is `false`, all the configurations for `spark.sql.adaptive.*` are >>>>> implicitly no-op. For me this is pretty intuitive and the one of major >>>>> benefits of the structured configurations. >>>>> >>>>> If we want to make it explicit, "all" sub-configurations should have >>>>> redundant part of the doc. More redundant if the condition is nested. I >>>>> agree this is the good step of "be kind" but less pragmatic. >>>>> >>>>> I'd be happy to follow the consensus we would make in this thread. >>>>> Appreciate more voices. >>>>> >>>>> Thanks, >>>>> Jungtaek Lim (HeartSaVioR) >>>>> >>>>> >>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>> wrote: >>>>> >>>>>> > I don't plan to document this officially yet >>>>>> Just to prevent confusion, I meant I don't yet plan to document the >>>>>> fact that we should write the relationships in configurations as a >>>>>> code/review guideline in https://spark.apache.org/contributing.html >>>>>> >>>>>> >>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I happened to review some PRs and I noticed that some configurations >>>>>>> don't have some information >>>>>>> necessary. >>>>>>> >>>>>>> To be explicit, I would like to make sure we document the direct >>>>>>> relationship between other configurations >>>>>>> in the documentation. For example, >>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` >>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is enabled. >>>>>>> That's clearly documented. >>>>>>> We're good in general given that we document them in general in >>>>>>> Apache Spark. >>>>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', >>>>>>> 'spark.sql.parquet.filterPushdown', etc. >>>>>>> >>>>>>> However, I noticed such a pattern that such information is missing >>>>>>> in some components in general, for example, >>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and >>>>>>> `spark.history.ui.acls.* ` >>>>>>> >>>>>>> I hope we all start to document such information. Logically users >>>>>>> can't know the relationship and I myself >>>>>>> had to read the codes to confirm when I review. >>>>>>> I don't plan to document this officially yet because to me it looks >>>>>>> a pretty logical request to me; however, >>>>>>> let me know if you guys have some different opinions. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>>>