Thank you, Holden!

Yes, having everything live in the ConfigEntry is attractive.

The main reason I proposed an alternative where the groups are defined in YAML 
is that if the config groups are defined in ConfigEntry, then altering the 
groupings – which is relevant only to the display of config documentation – 
requires rebuilding Spark. This feels a bit off to me in terms of design.

For example, on the SQL performance tuning page there is some narrative 
documentation about caching 
<https://spark.apache.org/docs/3.5.0/sql-performance-tuning.html#caching-data-in-memory>,
 plus a table of relevant configs. If I want an additional config to show up in 
this table, I need to add it to the config group that backs the table.

With the ConfigEntry approach in #44755 
<https://github.com/apache/spark/pull/44755>, that means editing the 
appropriate ConfigEntry and rebuilding Spark before I can regenerate the config 
table.

val SOME_CONFIG = buildConf("spark.sql.someCachingRelatedConfig")
  .doc("some documentation")
  .version("2.1.0")
  .withDocumentationGroup("sql-tuning-caching-data")  // assign group to the 
config
With the YAML approach in #44756 <https://github.com/apache/spark/pull/44756>, 
that means editing the config group defined in the YAML file and regenerating 
the config table. No Spark rebuild required.

sql-tuning-caching-data:
- spark.sql.inMemoryColumnarStorage.compressed
- spark.sql.inMemoryColumnarStorage.batchSize
- spark.sql.someCachingRelatedConfig  # add config to the group
In both cases the config names, descriptions, defaults, etc. will be pulled 
from the ConfigEntry when building the HTML tables.

I prefer the latter approach but I’m open to whatever committers are more 
comfortable with. If you prefer the former, then I’ll focus on that and ping 
you for reviews accordingly!


> On Feb 21, 2024, at 11:43 AM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> I think this is a good idea. I like having everything in one source of truth 
> rather than two (so option 1 sounds like a good idea); but that’s just my 
> opinion. I'd be happy to help with reviews though.
> 
> On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas <nicholas.cham...@gmail.com 
> <mailto:nicholas.cham...@gmail.com>> wrote:
>> I know config documentation is not the most exciting thing. If there is 
>> anything I can do to make this as easy as possible for a committer to 
>> shepherd, I’m all ears!
>> 
>> 
>>> On Feb 14, 2024, at 8:53 PM, Nicholas Chammas <nicholas.cham...@gmail.com 
>>> <mailto:nicholas.cham...@gmail.com>> wrote:
>>> 
>>> I’m interested in automating our config documentation and need input from a 
>>> committer who is interested in shepherding this work.
>>> 
>>> We have around 60 tables of configs across our documentation. Here’s a 
>>> typical example. 
>>> <https://github.com/apache/spark/blob/736d8ab3f00e7c5ba1b01c22f6398b636b8492ea/docs/sql-performance-tuning.md?plain=1#L65-L159>
>>> 
>>> These tables span several thousand lines of manually maintained HTML, which 
>>> poses a few problems:
>>> The documentation for a given config is sometimes out of sync across the 
>>> HTML table and its source `ConfigEntry`.
>>> Internal configs that are not supposed to be documented publicly sometimes 
>>> are.
>>> Many config names and defaults are extremely long, posing formatting 
>>> problems.
>>> 
>>> Contributors waste time dealing with these issues in a losing battle to 
>>> keep everything up-to-date and consistent.
>>> 
>>> I’d like to solve all these problems by generating HTML tables 
>>> automatically from the `ConfigEntry` instances where the configs are 
>>> defined.
>>> 
>>> I’ve proposed two alternative solutions:
>>> #44755 <https://github.com/apache/spark/pull/44755>: Enhance `ConfigEntry` 
>>> so a config can be associated with one or more groups, and use that new 
>>> metadata to generate the tables we need.
>>> #44756 <https://github.com/apache/spark/pull/44756>: Add a standalone YAML 
>>> file where we define config groups, and use that to generate the tables we 
>>> need.
>>> 
>>> If you’re a committer and are interested in this problem, please chime in 
>>> on whatever approach appeals to you. If you think this is a bad idea, I’m 
>>> also eager to hear your feedback.
>>> 
>>> Nick
>>> 
> 
> 

Reply via email to