aaron-ang commented on PR #19367:
URL: https://github.com/apache/datafusion/pull/19367#issuecomment-4416623759

   hi @2010YOUY01 @alamb, i've pushed a rewrite following the `Documentation` / 
`ScalarUDFImpl::documentation` pattern. One deliberate divergence I'd like to 
check before extending to more operators.
   
   **Mirror of UDF pipeline:**
   
   | UDF                                                | Metrics               
                                 |
   | -------------------------------------------------- | 
------------------------------------------------------ |
   | `Documentation` / `DocumentationBuilder`           | 
`MetricsDocumentation` / `MetricsDocumentationBuilder` |
   | `ScalarUDFImpl::documentation(&self)`              | 
`ExecutionPlan::metrics_documentation(&self)`          |
   | `#[user_doc(...)]` → `fn doc(&self)`               | `#[metrics_doc(...)]` 
→ `fn metrics_doc()`             |
   | `default_*_functions()`                            | 
`default_metric_docs()`                                |
   | `print_functions_docs` / `update_function_docs.sh` | `print_metric_docs` / 
`update_metric_docs.sh`          |
   
   `inventory` / `build.rs` / attribute+derive layering are gone. Each operator 
declares its docs via `#[metrics_doc]` next to the struct.
   
   **Divergence: assoc fn vs `&self`**. UDF doc-gen iterates registered 
instances (`Vec<Arc<ScalarUDF>>`) and dispatches via `&self`. `ExecutionPlan`s 
aren't registered — to literally mirror, we'd need sample plans:
   
   ```rust
   pub fn default_execution_plans() -> Vec<Arc<dyn ExecutionPlan>> {
       let empty = Arc::new(EmptyExec::new(Arc::new(Schema::empty())));
       let always_true = 
Arc::new(Literal::new(ScalarValue::Boolean(Some(true))));
       vec![
           Arc::new(FilterExec::try_new(always_true, 
Arc::clone(&empty)).unwrap()),
           Arc::new(RepartitionExec::try_new(empty, 
Partitioning::UnknownPartitioning(1)).unwrap()),
           // ...one factory per operator...
       ]
   }
   ```
   
   While it is cheap for `FilterExec`/`RepartitionExec`, it will get complex 
for `HashJoinExec` (matching schemas + `JoinFilter` + `JoinType`), 
`AggregateExec` (grouping + aggregate exprs), etc.
   
   The PoC has `#[metrics_doc]` emit an **assoc** `fn metrics_doc()` (no 
`&self`); the trait method delegates. Trait method stays available for runtime 
callers (`EXPLAIN ANALYZE` etc.); doc-gen skips instance-construction. Adding 
an operator costs one attribute + one registry line.
   
   |                   | Sample factories           | Current PoC (assoc fn)    
      |
   | ----------------- | -------------------------- | 
------------------------------- |
   | Trait API         | `&self` only               | `&self` + assoc 
`metrics_doc()` |
   | Per-operator cost | attribute + sample factory | attribute + 1 registry 
line     |
   | Strict UDF parity | yes                        | no — two entry points     
      |
   
   I leaned toward assoc-fn because the sample-factory cost scales with 
operator complexity, and the trait method (the motivation in your original 
suggestion) is still there. Happy to switch if strict parity matters more, 
since PoC scope is currently small. Which way would you prefer before I extend 
to the rest?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to