[PR] feat: generate expression reference doc from code [datafusion-comet]

via GitHub Wed, 03 Jun 2026 12:21:02 -0700


andygrove opened a new pull request, #4585:
URL: https://github.com/apache/datafusion-comet/pull/4585


   ## Which issue does this PR close?
   
   N/A. Follow-on to #4583. This reduces drift and maintenance friction in the 
expression reference doc by generating it from code.
   
   > Note: this PR is stacked on #4583. Until that merges, this diff also 
contains its 2 prettier-formatting commits; they will drop out once #4583 lands.
   
   ## Rationale for this change
   
   `docs/source/user-guide/latest/expressions.md` was hand-maintained: every PR 
that added or changed an expression edited the tables by hand. That let the doc 
drift from reality (a function supported in code but still listed as planned, 
or a new Spark built-in never added) and made large aligned tables 
conflict-prone.
   
   The Compatibility Guide is already generated by `GenerateDocs` from each 
serde's `getCompatibleNotes` / `getIncompatibleReasons` / 
`getUnsupportedReasons`. This PR extends the same generator to also produce the 
expression reference, so the overview is derived from the code that actually 
decides support, and stays complete and current.
   
   ## What changes are included in this PR?
   
   - New pure helper `org.apache.comet.ExpressionReference`: status model, row 
resolution, table rendering, and Spark `FunctionRegistry` enumeration 
(unit-tested in isolation).
   - `GenerateDocs` extended to: enumerate every Spark built-in (with its 
group), derive Supported status and a Compatibility Guide link from the serde 
maps, and fall back to a curated status list for planned / not-planned 
functions. The curated list lives in `GenerateDocs.scala` on purpose: that file 
is excluded from the heavy CI path filters in `dev/ci/compute-changes.py`, so 
editing the list (for example when an issue is filed) does not trigger the 
Spark SQL and Iceberg jobs.
   - `expressions.md` per-group tables are now generated between 
`<!--BEGIN:EXPR_TABLE[group]-->` markers; the prose was updated to drop the 
"Incorrect by default" status.
   - Doc generation pinned to the Spark 4.1 profile (newest `FunctionRegistry`) 
in `dev/generate-release-docs.sh` and `docs/build.sh`.
   - The reference is a concise overview: it carries a short summary plus a 
link into the Compatibility Guide for detail, with no duplicated note text.
   
   Known follow-ups (not in this PR): populate per-expression summary notes via 
a new `getExpressionSummary` (currently `None`, so serde-backed rows have 
sparse notes); add a CI check that fails when the generated doc is stale; 
rename the curated `PlannedExpr` type now that it also holds Supported entries.
   
   ## How are these changes tested?
   
   - `ExpressionReferenceSuite` covers the status model, every branch of row 
resolution (serde + link, serde without page, planned + issue, not-planned, 
unclassified), and rendering.
   - `FunctionRegistryEnumerationSuite` verifies enumeration against real Spark 
built-ins.
   - Regeneration is idempotent (re-running the generator produces no diff), 
the generated doc has zero unclassified rows, and all tracking-issue links were 
verified to match the prior hand-written doc exactly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: generate expression reference doc from code [datafusion-comet]

Reply via email to