Re: Request to document the direct relationship between other configurations

Hyukjin Kwon Wed, 12 Feb 2020 03:55:20 -0800

Also, I would like to hear other people' thoughts on here. Could I ask what
you guys think about this in general?


2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[email protected]>님이 작성:

> To do that, we should explicitly document such structured configuration
> and implicit effect, which is currently missing.
> I would be more than happy if we document such implied relationship, *and*
> if we are very sure all configurations are structured correctly coherently.
> Until that point, I think it might be more practical to simply document it
> for now.
>
> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
> on practice - whether to duplicate description between configuration code
> and doc. I have been asked to add description on configuration code
> regardlessly, and existing codebase doesn't. This configuration is
> widely-used one.
> This is actually something we should fix too. in SQL configuration, now we
> don't have such duplications as of
> https://github.com/apache/spark/pull/27459 as it generates. We should do
> it in other configurations.
>
>
> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <[email protected]>님이
> 작성:
>
>> I'm looking into the case of `spark.dynamicAllocation` and this seems to
>> be the thing to support my voice.
>>
>>
>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>>
>> I don't disagree with adding "This requires spark.shuffle.service.enabled
>> to be set." in the description of `spark.dynamicAllocation.enabled`. This
>> cannot be inferred implicitly, hence it should be better to have it.
>>
>> Why I'm in favor of structured configuration & implicit effect over
>> describing everything explicitly is there.
>>
>> 1. There're 10 configurations (if the doc doesn't miss any other
>> configuration) except `spark.dynamicAllocation.enabled`, and only 4
>> configurations are referred in the description of
>> `spark.dynamicAllocation.enabled` - majority of config keys are missing.
>> 2. I think it's intentional, but the table starts
>> with `spark.dynamicAllocation.enabled` which talks implicitly but
>> intuitively that if you disable this then everything on dynamic allocation
>> won't work. Missing majority of references on config keys don't get it hard
>> to understand.
>> 3. Even `spark.dynamicAllocation` has bad case - see
>> `spark.dynamicAllocation.shuffleTracking.enabled` and
>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
>> structure of configuration. I think this is worse than not explicitly
>> mentioning the description. Let's assume the name has
>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive
>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false`
>> would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?
>>
>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
>> on practice - whether to duplicate description between configuration code
>> and doc. I have been asked to add description on configuration code
>> regardlessly, and existing codebase doesn't. This configuration is
>> widely-used one.
>>
>>
>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <[email protected]>
>> wrote:
>>
>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it
>>> although it might be "redundant" :-) since anyone can give feedback to any
>>> thread in Spark dev mailing list, and discuss.
>>>
>>> This is actually more prevailing given my rough reading of configuration
>>> files. I would like to see this missing relationship as a bad pattern,
>>> started from a personal preference.
>>>
>>> > Personally I'd rather not think someone won't understand setting
>>> `.enabled` to `false` means the functionality is disabled and effectively
>>> it disables all sub-configurations.
>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this
>>> is pretty intuitive and the one of major
>>> > benefits of the structured configurations.
>>>
>>> I don't think this is a good idea we assume for users to know such
>>> contexts. One might think
>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
>>> partially enable the feature. It is better to be explicit to document
>>> since some of configurations are even difficult for users to confirm if it
>>> is working or not.
>>> For instance, one might think setting
>>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then,
>>> they realise the log is not rolling later after the file
>>> size becomes bigger.
>>>
>>>
>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <[email protected]>님이
>>> 작성:
>>>
>>>> I'm sorry if I miss something, but this is ideally better to be started
>>>> as [DISCUSS] as I haven't seen any reference to have consensus on this
>>>> practice.
>>>>
>>>> For me it's just there're two different practices co-existing on the
>>>> codebase, meaning it's closer to the preference of individual (with
>>>> implicitly agreeing that others have different preferences), or it hasn't
>>>> been discussed thoughtfully.
>>>>
>>>> Personally I'd rather not think someone won't understand setting
>>>> `.enabled` to `false` means the functionality is disabled and effectively
>>>> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled`
>>>> is `false`, all the configurations for `spark.sql.adaptive.*` are
>>>> implicitly no-op. For me this is pretty intuitive and the one of major
>>>> benefits of the structured configurations.
>>>>
>>>> If we want to make it explicit, "all" sub-configurations should have
>>>> redundant part of the doc. More redundant if the condition is nested. I
>>>> agree this is the good step of "be kind" but less pragmatic.
>>>>
>>>> I'd be happy to follow the consensus we would make in this thread.
>>>> Appreciate more voices.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>>
>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <[email protected]>
>>>> wrote:
>>>>
>>>>> > I don't plan to document this officially yet
>>>>> Just to prevent confusion, I meant I don't yet plan to document the
>>>>> fact that we should write the relationships in configurations as a
>>>>> code/review guideline in https://spark.apache.org/contributing.html
>>>>>
>>>>>
>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[email protected]>님이 작성:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I happened to review some PRs and I noticed that some configurations
>>>>>> don't have some information
>>>>>> necessary.
>>>>>>
>>>>>> To be explicit, I would like to make sure we document the direct
>>>>>> relationship between other configurations
>>>>>> in the documentation. For example,
>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is enabled.
>>>>>> That's clearly documented.
>>>>>> We're good in general given that we document them in general in
>>>>>> Apache Spark.
>>>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
>>>>>> 'spark.sql.parquet.filterPushdown', etc.
>>>>>>
>>>>>> However, I noticed such a pattern that such information is missing in
>>>>>> some components in general, for example,
>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and
>>>>>> `spark.history.ui.acls.* `
>>>>>>
>>>>>> I hope we all start to document such information. Logically users
>>>>>> can't know the relationship and I myself
>>>>>> had to read the codes to confirm when I review.
>>>>>> I don't plan to document this officially yet because to me it looks a
>>>>>> pretty logical request to me; however,
>>>>>> let me know if you guys have some different opinions.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>

Re: Request to document the direct relationship between other configurations

Reply via email to