andygrove opened a new issue, #4420: URL: https://github.com/apache/datafusion-comet/issues/4420
## Motivation Comet's docs are growing — user guide, contributor guide, compatibility tables, operator references, plus per-release changelogs. Quality varies, terminology drifts, and once the nomenclature style guide from #4419 lands, we need a way to keep new and existing docs aligned with it. PR reviewers shouldn't have to be the last line of defense for prose quality, and contributors shouldn't have to internalize a multi-page style guide before opening a docs PR. A sibling skill to `review-comet-pr`, focused on documentation, would let reviewers (and contributors during self-review) get a docs-expert pass over a PR or an existing doc. The skill produces advisory feedback — it does not edit files or post comments directly, mirroring `review-comet-pr`. ## What the skill should check ### Style guide adherence (post #4419) - Bare *"native"* used as a vague adjective (e.g. *"runs natively"*, *"the native path"*) — flag and suggest the specific axis: *Rust-implemented*, *Arrow-native*, *Comet pipeline*. - Permitted compounds (*native Rust*, *Arrow-native*, *native shuffle* in shuffle-pair context) recognized and not flagged. - *"Vectorized"* used as a synonym for *columnar* — flag. - *"Falls back to Spark"* / *"Spark fallback"* used consistently for the same concept. - Operator names match plan output exactly (e.g. `CometProject`, not `CometProjectExec` or `Comet Project`). ### Information architecture - Audience is identifiable from the first paragraph — user, contributor, or operator. - Heading hierarchy is correct (no jumps from H2 to H4, no multiple H1s). - Tables used for structured comparisons (operators, configs, support matrices) rather than bulleted prose. - Long pages have a brief overview / table of contents at the top. ### Completeness - New user-guide pages have: overview, prerequisites or version applicability, at least one runnable example, links to related docs. - New contributor-guide pages have: scope/audience, prerequisites, the procedure, how to verify it worked. - New configs in prose are also added to `configs.md` with key, default, type, version added. - New operators in prose are also added to the operator reference in `understanding-comet-plans.md` and `operators.md`. - New expressions are reflected in `spark_expressions_support.md`. ### Accuracy - Config keys mentioned exist in code (grep `spark.comet.*` against the repo). - Operator names mentioned exist as classes. - Spark version claims (*"Spark 3.4+"*, *"Spark 4.0 only"*) match the version profiles the surrounding code lives in. - Code samples are syntactically valid for the language they claim (Scala / Python / SQL / shell). - Links resolve (relative paths exist; external links return 200). ### Voice and tone - Present tense, active voice, second person where appropriate. - No marketing fluff (*"blazing fast"*, *"seamlessly"*, *"powerful"*). - No future-tense aspirations in user-facing docs (*"will support"*, *"is planned to"*) — those belong in the roadmap, not the user guide. - No stale TODOs or `XXX` markers in user-visible files. ### Apache project hygiene - New `.md` files have the ASF license header. - No copyrighted third-party content without attribution. - Diagrams/images referenced exist in the repo. ### Comet-specific conventions - User-guide content lives under `docs/source/user-guide/latest/`, contributor-guide content under `docs/source/contributor-guide/`. Flag misplacements. - Spark version-specific content notes the version profile clearly. - Plan-output examples use real plan formatting (tree-form with `+-` connectors), not invented prose. - Cross-references use relative links within `docs/source/`. ## Scope The skill should support two modes: 1. **PR review mode** — given a PR number or branch, surface docs changes (and prose changes in code comments / config descriptions / Scaladoc) and audit them. 2. **Standalone audit mode** — given a file or directory under `docs/source/`, audit it as-is. In both modes the skill returns a prioritized list (must-fix / should-fix / nit) suitable for a reviewer to paste or paraphrase into review comments. ## Output format Mirror `review-comet-pr`: produce structured advisory feedback for the human, not direct edits. Categorize findings by severity. For each finding: - File and line reference. - The rule that was violated (link to the style guide section). - A concrete suggested rewrite when one is obvious. The skill must not edit files, post PR comments, or push branches. ## Open questions - Should the skill enforce stricter rules on user-guide content (lower tolerance for jargon, mandatory examples) than contributor-guide content (where domain-specific vocabulary is fine)? - Should the skill check generated docs (e.g. expression support tables generated by `make`) or only hand-written prose? - Auto-detection of the style guide source — pin to a specific doc path (e.g. `docs/source/contributor-guide/style_guide.md` once #4419 lands), or read the latest committed version on each invocation? - Is there value in a separate `audit-comet-docs` skill for full-site audits versus a single skill that handles both PR review and standalone audit modes? ## Dependencies - Style guide doc landed (depends on #4419). - Skill author needs read access to the docs tree and the repo's source for accuracy checks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
