adriangb opened a new issue, #18477:
URL: https://github.com/apache/datafusion/issues/18477

   Workin on https://github.com/apache/datafusion/pull/18192 it became clear to 
me that there are both shortcomings in the current design and competing 
interests in terms of what is supported / what state is passed around. I 
thought it would be worth sitting down for a bit and trying to come up with a 
new design that satisfies these concerns.
   
   My design goals were:
   - More flexible pre/post processing. The current `Codec` system only 
operates as a "fallback" and doesn't support hooking into the moment before 
trying to deserialize or the moment after a successful deserialization. I have 
a lot of example use cases for this that I can share.
   - Optionality of features. Some use cases demand more information / 
capabilities for serialization / deserialization (e.g. distributed systems want 
a function registry, a cache similar to #18192, etc. while something that is 
just looking to cross the FFI boundary might be okay with no registry/state and 
just failing if it encounters a UDF).
   
   The design I landed on is something like this:
   
   ```rust
   // DataFusion code
   trait Deserializer {
       fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> 
Result<Arc<dyn PhysicalExpr>> {
           // most methods have default implementations so you only have to 
implement those you want to override
           deserialize_physical_expr(self, proto)
       }
   
       fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> {
           // default implementation errors, we don't have a registry to call. 
you must override if you do.
           internal_datafusion_err!("you must implement deserialize_udf 
yourself and put the function into a function registry")
       }
   }
   
   /// Both the top level entry point (if what you have is a PhysicalExprNode)
   /// and the default implementation that will get delegated back to 
recursively
   fn deserialize_physical_expr<D: Deserializer + ?Sized>(deserializer: &mut D, 
proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
       let children: Vec<PhysicalExprNode> = vec![]; // Placeholder for child 
nodes, from proto.children()?
       let deserialized_children = children
           .into_iter()
           // it's important here that we call back into the trait!
           .map(|child_proto| 
deserializer.deserialize_physical_expr(child_proto))
           .collect::<Result<Vec<_>>>()?;
       // Reconstruct the node with it's children
       todo!()
   }
   
   
   fn deserialize_udf<D: Deserializer + ?Sized>(deserializer: &mut D, registry: 
&dyn FunctionRegistry, proto: ScalarUdfExprNode) -> Result<()> {
       todo!();
   }
   
   // Users code: wants to override deserialization for PhysicalExpr
   struct MySerializer {
       field: String,
       cache: HashMap<String, Arc<dyn PhysicalExpr>>,
   }
   
   impl Deserializer for MySerializer {
       fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> 
Result<Arc<dyn PhysicalExpr>> {
           // can run arbitrary stuff before, short circuit, etc.
           println!("Deserializing using MySerializer {}", self.field);
           let res = deserialize_physical_expr(&mut *self, proto)?;
           // can run arbitrary stuff after and replace the result if necessary
           println!("Deserialized result: {:?}", res);
           Ok(res)
       }
   }
   
   // Users code: wants to support deserialization of UDFs
   struct MyDeserializerWithUDF<'a> {
       registry: &'a dyn FunctionRegistry,
   }
   
   impl Deserializer for MyDeserializerWithUDF<'_> {
       fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> 
Result<Arc<dyn PhysicalExpr>> {
           deserialize_physical_expr(self, proto)
       }
   
       fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> {
           deserialize_udf(self, self.registry, proto)
       }
   }
   ```
   
   Before continuing down this path I want go gauge appetite for this change, 
and understand if it satisfies the use cases currently out there or falls short.
   
   @timsaucer @milenkovicm @Jefffrey any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to