Re: [PR] Adding node_id to ExecutionPlanProperties [datafusion]

via GitHub Thu, 19 Dec 2024 06:41:02 -0800


berkaysynnada commented on PR #12186:
URL: https://github.com/apache/datafusion/pull/12186#issuecomment-2554359314


   Hi @emgeee and @ameyc. I apologize again for the delay in providing 
feedback, but I believe we can iterate quickly now and align on the best design 
to suit every use case.
   
   Here is the approach we use, and I believe it could address your concerns 
while minimizing changes to the existing API's:
   
   We introduce generate_id and with_id methods to the ExecutionPlan trait:
   ```rust
   trait ExecutionPlan {
       ...
       fn generate_id(&self) -> Result<String> {
           let children = self.children();
           match children.as_slice() {
               [] => exec_err!("Source operator {:?} should implement the 
generate_id method", self.type_name()),
               [child] => {
                   let child_id = child.generate_id()?;
                   Ok(format!("[{child_id}]"))
               }
               children => children
                   .iter()
                   .map(|child| child.generate_id())
                   .collect::<Result<Vec<_>>>()
                   .map(|ids| {
                       let result = ids.join(", ");
                       format!("[{result}]")
                   }),
           }
       }
   
       fn with_id(self: Arc<Self>, id: String) -> Result<Option<Arc<dyn 
ExecutionPlan>>> {
           if self.children().is_empty() {
               not_impl_err!("Source operators must implement with_id() method")
           } else {
               Ok(None)
           }
       }
       ...
   }
   ```
   Source operators, such as CsvExec, implement these methods:
   ```rust
   impl ExecutionPlan for CsvExec {
       ...
       fn generate_id(&self) -> Result<String> {
           let Some(id) = self.id.as_ref() else {
               return exec_err!("Source ID is not set");
           };
           Ok(id.clone())
       }
   
       fn with_id(self: Arc<Self>, id: String) -> Result<Option<Arc<dyn 
ExecutionPlan>>> {
           let mut exec = self.as_ref().clone();
           exec.id = Some(id);
           Ok(Some(Arc::new(exec)))
       }
       ...
   }
   ```
   One significant advantage of this approach is that it introduces minimal API 
changes, focusing only on source operators. It also provides flexible usage 
through customizable nesting logic for ID's. For instance, relationships 
between partitioned inputs or child operators can be refined to align with use 
case needs.
   
   A potential limitation is the use of string for ID's. However, this choice 
offers flexibility for cases where human-readable identifiers are preferred. 
Additionally, it makes the child-parent relationships self-explanatory, which 
is beneficial for debugging and visualization purposes.
   
   In our use case, we construct a parallel tree to enrich operator metadata, 
which is for snapshotting. This secondary tree is built based on these ID's and 
is designed to remain isolated from the original execution plan. We have 
utility functions that streamline this process, and we’d be happy to contribute 
them upstream if needed.
   
   If this approach aligns with upstream requirements, I’d be glad to 
collaborate on refining and integrating it, along with the necessary utilities. 
Let me know if this sounds like a viable direction to proceed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Adding node_id to ExecutionPlanProperties [datafusion]

Reply via email to