andygrove opened a new issue, #4419:
URL: https://github.com/apache/datafusion-comet/issues/4419
## Problem
Comet's documentation and operator names use **"native"** to mean three
different things:
1. **Rust-implemented** — *"`Comet*` nodes that run natively in Rust"*
(`docs/source/user-guide/latest/understanding-comet-plans.md`)
2. **On the Comet pipeline** — *"Comet executes Spark's Scala and Java UDFs
on the native Comet path"*
(`docs/source/user-guide/latest/scala_java_udfs.md`), even though the UDF runs
as JVM bytecode produced by codegen
3. **Fully-Rust scan** — `CometNativeScan` (fully Rust) vs `CometBatchScan`
(DSv2 with JVM reader producing Arrow)
The shuffle naming is similarly muddled: `CometExchange` ("native shuffle")
and `CometColumnarExchange` ("JVM columnar shuffle") are *both* columnar and
*both* serialize via Arrow IPC, but only one of the operator names says
"Columnar."
This will get worse, not better. The roadmap calls for Java/Scala UDFs that
operate directly on Arrow columnar data; the Scala UDF codegen path already
lands JVM bytecode inside what we call "the native Comet path." Comet's value
proposition is shifting from *"we rewrote operators in Rust"* to *"we keep your
data Arrow-native end-to-end"* — and the nomenclature hasn't caught up.
## Proposed value prop
> Comet keeps Spark queries **Arrow-native end-to-end** — operators,
expressions, shuffle, and broadcast all stay in Apache Arrow columnar format,
avoiding the per-row overhead of Spark's row-based engine. Within the
Arrow-native pipeline, operators and expressions execute in either native Rust
code (via Apache DataFusion) or JVM code that operates directly on Arrow
batches.
## Proposed vocabulary
Three orthogonal concepts the docs currently conflate. We should name each
axis distinctly.
### Axis 1 — Pipeline membership
- **Comet pipeline** — operators and expressions Comet handles, regardless
of implementation language. Replaces "native Comet path," "on Comet,"
"accelerated by Comet."
- **Spark fallback** — drops out of the pipeline into row-based Spark
execution. (Already used consistently; keep it.)
### Axis 2 — Implementation language
Within the Arrow-native pipeline, expressions and operators may be:
- **Rust-implemented** / **native Rust** — Rust code via DataFusion.
- **JVM-implemented** — Scala/Java code, including codegen'd UDFs.
- **Hybrid** — JVM codegen with native callouts, where the JVM kernel is the
primary execution path and Rust is invoked via JNI for specific subexpressions.
The implementation language is an internal detail; from the query's
perspective the work is **Arrow-native** regardless. Hybrid impls trade JNI
crossing cost for native compute speedups on subexpressions where the win is
large enough to justify the boundary.
### Axis 3 — Data format
- **Arrow-native** — the unifying property of the Comet pipeline. Data is in
Apache Arrow columnar format throughout; no per-row materialization or
transition cost.
- **Arrow IPC** — the wire/disk format used by Arrow-native shuffle and
broadcast.
- **Spark rows** / **row-based** — `UnsafeRow`, the format outside the
pipeline.
- **Spark columnar** — non-Comet `ColumnarBatch` from a Spark vectorized
reader; requires a transition into Arrow-native.
- Reserve **"vectorized"** for SIMD-specific contexts (the Parquet reader).
Don't use it as a synonym for columnar.
### A new category to name
**Arrow-native JVM expression** — JVM code that operates directly on Arrow
batches (the Scala UDF codegen path; future Arrow UDFs; future hybrid
JVM/native impls). Distinct from existing **JVM-side plumbing operators**
(`CometUnion`, `CometCoalesce`, `CometCollectLimit`) which coordinate Arrow
batches but don't compute over column data.
### Rule for "native"
Bare **"native"** as a vague adjective in prose is banned; replace with the
specific axis it refers to. Compound forms where the binding word fixes the
meaning are permitted:
| Form | Meaning | OK? |
|---|---|---|
| *native Rust* / *Rust-native* | Rust implementation | yes |
| *Arrow-native* | Arrow columnar throughout | yes |
| *native shuffle* | Rust-implemented shuffle, paired with *JVM shuffle* |
yes |
| *native scan* (in `CometNativeScan` context) | Rust scan, paired with
`CometBatchScan` | yes |
| *runs natively* / *native execution* / *the native path* | ambiguous | no |
## Proposed plan-node taxonomy
The "three kinds of nodes" framing in `understanding-comet-plans.md` should
become **four**:
| Category | What it is | Example |
|---|---|---|
| Arrow-native Rust operator | Rust compute over Arrow batches |
`CometProject`, `CometHashAggregate`, `CometSort` |
| Arrow-native JVM expression | JVM compute over Arrow batches; may include
hybrid impls that call native code for subexpressions | Scala UDF codegen
(today); Arrow UDFs and hybrid JVM/native expressions (future) |
| Arrow-native JVM plumbing | JVM coordination of Arrow batches; no per-row
compute | `CometUnion`, `CometCoalesce`, `CometBroadcastExchange` |
| Spark fallback | Row-based Spark execution | `Project`, `HashAggregate`,
plain `Exchange` |
Hybrid JVM/native expressions fold into the "Arrow-native JVM expression"
row rather than getting a fifth category — the user-visible property is the
same (Arrow-native, executed via a JVM kernel), and treating the native callout
as an implementation detail keeps the taxonomy from sprawling.
## Proposed operator renames
Touches plan-stability goldens, external tooling, and user docs. Candidates,
ranked by how misleading the current name is:
| Current | Proposed | Reason |
|---|---|---|
| `CometExchange` | **`CometNativeShuffleExchange`** | Both shuffle
implementations are columnar and both use Arrow IPC; the differentiator is
implementation, not format |
| `CometColumnarExchange` | **`CometJvmShuffleExchange`** | Symmetric pair
with the above |
| `CometNativeExec` | Keep (or rename to `CometRustExec`) | Internal
wrapper, not user-visible in plans; lower priority |
| `CometNativeScan` | Keep | "Native" = Rust here, paired with
`CometBatchScan`; already unambiguous as a compound |
| `CometIcebergNativeScan`, `CometCsvNativeScan` | Keep | Same |
| `CometNativeColumnarToRow` | Keep | Distinguishes from JVM
`CometColumnarToRow`; consistent compound |
For renames that are user-visible in plans (`CometExchange` is the main
one), provide a deprecation alias for one minor release and update
plan-stability goldens in the same PR.
## Documentation rewrite scope
Files that need substantive prose changes:
- **`README.md`** — value-prop sentence using the Arrow-native framing
- **`docs/source/index.md`** — same
- **`docs/source/user-guide/latest/scala_java_udfs.md`** — replace "on the
native Comet path" with "in the Comet pipeline"; explain the path is
Arrow-native even though the UDF is JVM bytecode. Also clarify the existing
line *"e.g. `myUdf(upper(s))` runs as one native unit"* — the word "native"
here is exactly the kind of ambiguity this issue exists to fix.
- **`docs/source/user-guide/latest/understanding-comet-plans.md`** — extend
the three-category list to four; update the shuffle section to match the rename
- **`docs/source/contributor-guide/plugin_overview.md`** — "native Comet
operators" → "Rust-implemented operators"; explain that JVM Arrow expressions
(and future hybrid impls) slot into the same pipeline
- **`docs/source/contributor-guide/native_shuffle.md`** +
**`jvm_shuffle.md`** — align with new operator names; the file titles are
accurate so they can stay
- **`docs/source/about/gluten_comparison.md`** — the JVM-on-Arrow path is
now a Comet differentiator worth mentioning
## Migration plan
1. Land the docs vocabulary first (no code changes, no plan goldens touched).
2. Land operator renames with deprecation aliases in a single PR per pair.
3. Update plan-stability goldens in the rename PRs.
4. Remove deprecation aliases one minor release later.
## Open questions
- Should `CometNativeExec` rename in the same pass, or stay since it's
internal-only?
- Should the issue describe a parallel rename for any internal class names
(e.g. `CometNativeScanExec`) beyond what shows up in plan output?
- Any external tools that grep `CometExchange` we should ping before the
rename lands?
- **Hybrid JVM/native expressions** — the codegen-dispatch design opens the
door to JVM kernels that call into Rust via JNI for specific subexpressions
where the speedup outweighs the boundary cost. This isn't on the immediate
roadmap but the vocabulary should accommodate it without further naming churn.
The proposal treats hybrid impls as a flavor of *Arrow-native JVM expression*
rather than a separate category, on the principle that implementation-language
mix is an internal detail and the externally relevant property remains "the
work happens inside the Arrow-native pipeline." Worth confirming the community
agrees with this categorization before it becomes a contested rename later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]