[I] Establish nomenclature style guide and audit operator names for clarity [datafusion-comet]

via GitHub Sun, 24 May 2026 16:15:45 -0700


andygrove opened a new issue, #4419:
URL: https://github.com/apache/datafusion-comet/issues/4419


   ## Problem
   
   Comet's documentation and operator names use **"native"** to mean three 
different things:
   
   1. **Rust-implemented** — *"`Comet*` nodes that run natively in Rust"* 
(`docs/source/user-guide/latest/understanding-comet-plans.md`)
   2. **On the Comet pipeline** — *"Comet executes Spark's Scala and Java UDFs 
on the native Comet path"* 
(`docs/source/user-guide/latest/scala_java_udfs.md`), even though the UDF runs 
as JVM bytecode produced by codegen
   3. **Fully-Rust scan** — `CometNativeScan` (fully Rust) vs `CometBatchScan` 
(DSv2 with JVM reader producing Arrow)
   
   The shuffle naming is similarly muddled: `CometExchange` ("native shuffle") 
and `CometColumnarExchange` ("JVM columnar shuffle") are *both* columnar and 
*both* serialize via Arrow IPC, but only one of the operator names says 
"Columnar."
   
   This will get worse, not better. The roadmap calls for Java/Scala UDFs that 
operate directly on Arrow columnar data; the Scala UDF codegen path already 
lands JVM bytecode inside what we call "the native Comet path." Comet's value 
proposition is shifting from *"we rewrote operators in Rust"* to *"we keep your 
data Arrow-native end-to-end"* — and the nomenclature hasn't caught up.
   
   ## Proposed value prop
   
   > Comet keeps Spark queries **Arrow-native end-to-end** — operators, 
expressions, shuffle, and broadcast all stay in Apache Arrow columnar format, 
avoiding the per-row overhead of Spark's row-based engine. Within the 
Arrow-native pipeline, operators and expressions execute in either native Rust 
code (via Apache DataFusion) or JVM code that operates directly on Arrow 
batches.
   
   ## Proposed vocabulary
   
   Three orthogonal concepts the docs currently conflate. We should name each 
axis distinctly.
   
   ### Axis 1 — Pipeline membership
   
   - **Comet pipeline** — operators and expressions Comet handles, regardless 
of implementation language. Replaces "native Comet path," "on Comet," 
"accelerated by Comet."
   - **Spark fallback** — drops out of the pipeline into row-based Spark 
execution. (Already used consistently; keep it.)
   
   ### Axis 2 — Implementation language
   
   Within the Arrow-native pipeline, expressions and operators may be:
   
   - **Rust-implemented** / **native Rust** — Rust code via DataFusion.
   - **JVM-implemented** — Scala/Java code, including codegen'd UDFs.
   - **Hybrid** — JVM codegen with native callouts, where the JVM kernel is the 
primary execution path and Rust is invoked via JNI for specific subexpressions.
   
   The implementation language is an internal detail; from the query's 
perspective the work is **Arrow-native** regardless. Hybrid impls trade JNI 
crossing cost for native compute speedups on subexpressions where the win is 
large enough to justify the boundary.
   
   ### Axis 3 — Data format
   
   - **Arrow-native** — the unifying property of the Comet pipeline. Data is in 
Apache Arrow columnar format throughout; no per-row materialization or 
transition cost.
   - **Arrow IPC** — the wire/disk format used by Arrow-native shuffle and 
broadcast.
   - **Spark rows** / **row-based** — `UnsafeRow`, the format outside the 
pipeline.
   - **Spark columnar** — non-Comet `ColumnarBatch` from a Spark vectorized 
reader; requires a transition into Arrow-native.
   - Reserve **"vectorized"** for SIMD-specific contexts (the Parquet reader). 
Don't use it as a synonym for columnar.
   
   ### A new category to name
   
   **Arrow-native JVM expression** — JVM code that operates directly on Arrow 
batches (the Scala UDF codegen path; future Arrow UDFs; future hybrid 
JVM/native impls). Distinct from existing **JVM-side plumbing operators** 
(`CometUnion`, `CometCoalesce`, `CometCollectLimit`) which coordinate Arrow 
batches but don't compute over column data.
   
   ### Rule for "native"
   
   Bare **"native"** as a vague adjective in prose is banned; replace with the 
specific axis it refers to. Compound forms where the binding word fixes the 
meaning are permitted:
   
   | Form | Meaning | OK? |
   |---|---|---|
   | *native Rust* / *Rust-native* | Rust implementation | yes |
   | *Arrow-native* | Arrow columnar throughout | yes |
   | *native shuffle* | Rust-implemented shuffle, paired with *JVM shuffle* | 
yes |
   | *native scan* (in `CometNativeScan` context) | Rust scan, paired with 
`CometBatchScan` | yes |
   | *runs natively* / *native execution* / *the native path* | ambiguous | no |
   
   ## Proposed plan-node taxonomy
   
   The "three kinds of nodes" framing in `understanding-comet-plans.md` should 
become **four**:
   
   | Category | What it is | Example |
   |---|---|---|
   | Arrow-native Rust operator | Rust compute over Arrow batches | 
`CometProject`, `CometHashAggregate`, `CometSort` |
   | Arrow-native JVM expression | JVM compute over Arrow batches; may include 
hybrid impls that call native code for subexpressions | Scala UDF codegen 
(today); Arrow UDFs and hybrid JVM/native expressions (future) |
   | Arrow-native JVM plumbing | JVM coordination of Arrow batches; no per-row 
compute | `CometUnion`, `CometCoalesce`, `CometBroadcastExchange` |
   | Spark fallback | Row-based Spark execution | `Project`, `HashAggregate`, 
plain `Exchange` |
   
   Hybrid JVM/native expressions fold into the "Arrow-native JVM expression" 
row rather than getting a fifth category — the user-visible property is the 
same (Arrow-native, executed via a JVM kernel), and treating the native callout 
as an implementation detail keeps the taxonomy from sprawling.
   
   ## Proposed operator renames
   
   Touches plan-stability goldens, external tooling, and user docs. Candidates, 
ranked by how misleading the current name is:
   
   | Current | Proposed | Reason |
   |---|---|---|
   | `CometExchange` | **`CometNativeShuffleExchange`** | Both shuffle 
implementations are columnar and both use Arrow IPC; the differentiator is 
implementation, not format |
   | `CometColumnarExchange` | **`CometJvmShuffleExchange`** | Symmetric pair 
with the above |
   | `CometNativeExec` | Keep (or rename to `CometRustExec`) | Internal 
wrapper, not user-visible in plans; lower priority |
   | `CometNativeScan` | Keep | "Native" = Rust here, paired with 
`CometBatchScan`; already unambiguous as a compound |
   | `CometIcebergNativeScan`, `CometCsvNativeScan` | Keep | Same |
   | `CometNativeColumnarToRow` | Keep | Distinguishes from JVM 
`CometColumnarToRow`; consistent compound |
   
   For renames that are user-visible in plans (`CometExchange` is the main 
one), provide a deprecation alias for one minor release and update 
plan-stability goldens in the same PR.
   
   ## Documentation rewrite scope
   
   Files that need substantive prose changes:
   
   - **`README.md`** — value-prop sentence using the Arrow-native framing
   - **`docs/source/index.md`** — same
   - **`docs/source/user-guide/latest/scala_java_udfs.md`** — replace "on the 
native Comet path" with "in the Comet pipeline"; explain the path is 
Arrow-native even though the UDF is JVM bytecode. Also clarify the existing 
line *"e.g. `myUdf(upper(s))` runs as one native unit"* — the word "native" 
here is exactly the kind of ambiguity this issue exists to fix.
   - **`docs/source/user-guide/latest/understanding-comet-plans.md`** — extend 
the three-category list to four; update the shuffle section to match the rename
   - **`docs/source/contributor-guide/plugin_overview.md`** — "native Comet 
operators" → "Rust-implemented operators"; explain that JVM Arrow expressions 
(and future hybrid impls) slot into the same pipeline
   - **`docs/source/contributor-guide/native_shuffle.md`** + 
**`jvm_shuffle.md`** — align with new operator names; the file titles are 
accurate so they can stay
   - **`docs/source/about/gluten_comparison.md`** — the JVM-on-Arrow path is 
now a Comet differentiator worth mentioning
   
   ## Migration plan
   
   1. Land the docs vocabulary first (no code changes, no plan goldens touched).
   2. Land operator renames with deprecation aliases in a single PR per pair.
   3. Update plan-stability goldens in the rename PRs.
   4. Remove deprecation aliases one minor release later.
   
   ## Open questions
   
   - Should `CometNativeExec` rename in the same pass, or stay since it's 
internal-only?
   - Should the issue describe a parallel rename for any internal class names 
(e.g. `CometNativeScanExec`) beyond what shows up in plan output?
   - Any external tools that grep `CometExchange` we should ping before the 
rename lands?
   - **Hybrid JVM/native expressions** — the codegen-dispatch design opens the 
door to JVM kernels that call into Rust via JNI for specific subexpressions 
where the speedup outweighs the boundary cost. This isn't on the immediate 
roadmap but the vocabulary should accommodate it without further naming churn. 
The proposal treats hybrid impls as a flavor of *Arrow-native JVM expression* 
rather than a separate category, on the principle that implementation-language 
mix is an internal detail and the externally relevant property remains "the 
work happens inside the Arrow-native pipeline." Worth confirming the community 
agrees with this categorization before it becomes a contested rename later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Establish nomenclature style guide and audit operator names for clarity [datafusion-comet]

Reply via email to