aaron-ang commented on PR #19367:
URL: https://github.com/apache/datafusion/pull/19367#issuecomment-4416623759
hi @2010YOUY01 @alamb, i've pushed a rewrite following the `Documentation` /
`ScalarUDFImpl::documentation` pattern. One deliberate divergence I'd like to
check before extending to more operators.
**Mirror of UDF pipeline:**
| UDF | Metrics
|
| -------------------------------------------------- |
------------------------------------------------------ |
| `Documentation` / `DocumentationBuilder` |
`MetricsDocumentation` / `MetricsDocumentationBuilder` |
| `ScalarUDFImpl::documentation(&self)` |
`ExecutionPlan::metrics_documentation(&self)` |
| `#[user_doc(...)]` → `fn doc(&self)` | `#[metrics_doc(...)]`
→ `fn metrics_doc()` |
| `default_*_functions()` |
`default_metric_docs()` |
| `print_functions_docs` / `update_function_docs.sh` | `print_metric_docs` /
`update_metric_docs.sh` |
`inventory` / `build.rs` / attribute+derive layering are gone. Each operator
declares its docs via `#[metrics_doc]` next to the struct.
**Divergence: assoc fn vs `&self`**. UDF doc-gen iterates registered
instances (`Vec<Arc<ScalarUDF>>`) and dispatches via `&self`. `ExecutionPlan`s
aren't registered — to literally mirror, we'd need sample plans:
```rust
pub fn default_execution_plans() -> Vec<Arc<dyn ExecutionPlan>> {
let empty = Arc::new(EmptyExec::new(Arc::new(Schema::empty())));
let always_true =
Arc::new(Literal::new(ScalarValue::Boolean(Some(true))));
vec![
Arc::new(FilterExec::try_new(always_true,
Arc::clone(&empty)).unwrap()),
Arc::new(RepartitionExec::try_new(empty,
Partitioning::UnknownPartitioning(1)).unwrap()),
// ...one factory per operator...
]
}
```
While it is cheap for `FilterExec`/`RepartitionExec`, it will get complex
for `HashJoinExec` (matching schemas + `JoinFilter` + `JoinType`),
`AggregateExec` (grouping + aggregate exprs), etc.
The PoC has `#[metrics_doc]` emit an **assoc** `fn metrics_doc()` (no
`&self`); the trait method delegates. Trait method stays available for runtime
callers (`EXPLAIN ANALYZE` etc.); doc-gen skips instance-construction. Adding
an operator costs one attribute + one registry line.
| | Sample factories | Current PoC (assoc fn)
|
| ----------------- | -------------------------- |
------------------------------- |
| Trait API | `&self` only | `&self` + assoc
`metrics_doc()` |
| Per-operator cost | attribute + sample factory | attribute + 1 registry
line |
| Strict UDF parity | yes | no — two entry points
|
I leaned toward assoc-fn because the sample-factory cost scales with
operator complexity, and the trait method (the motivation in your original
suggestion) is still there. Happy to switch if strict parity matters more,
since PoC scope is currently small. Which way would you prefer before I extend
to the rest?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]