adriangb opened a new issue, #18477: URL: https://github.com/apache/datafusion/issues/18477
Workin on https://github.com/apache/datafusion/pull/18192 it became clear to me that there are both shortcomings in the current design and competing interests in terms of what is supported / what state is passed around. I thought it would be worth sitting down for a bit and trying to come up with a new design that satisfies these concerns. My design goals were: - More flexible pre/post processing. The current `Codec` system only operates as a "fallback" and doesn't support hooking into the moment before trying to deserialize or the moment after a successful deserialization. I have a lot of example use cases for this that I can share. - Optionality of features. Some use cases demand more information / capabilities for serialization / deserialization (e.g. distributed systems want a function registry, a cache similar to #18192, etc. while something that is just looking to cross the FFI boundary might be okay with no registry/state and just failing if it encounters a UDF). The design I landed on is something like this: ```rust // DataFusion code trait Deserializer { fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> { // most methods have default implementations so you only have to implement those you want to override deserialize_physical_expr(self, proto) } fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> { // default implementation errors, we don't have a registry to call. you must override if you do. internal_datafusion_err!("you must implement deserialize_udf yourself and put the function into a function registry") } } /// Both the top level entry point (if what you have is a PhysicalExprNode) /// and the default implementation that will get delegated back to recursively fn deserialize_physical_expr<D: Deserializer + ?Sized>(deserializer: &mut D, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> { let children: Vec<PhysicalExprNode> = vec![]; // Placeholder for child nodes, from proto.children()? let deserialized_children = children .into_iter() // it's important here that we call back into the trait! .map(|child_proto| deserializer.deserialize_physical_expr(child_proto)) .collect::<Result<Vec<_>>>()?; // Reconstruct the node with it's children todo!() } fn deserialize_udf<D: Deserializer + ?Sized>(deserializer: &mut D, registry: &dyn FunctionRegistry, proto: ScalarUdfExprNode) -> Result<()> { todo!(); } // Users code: wants to override deserialization for PhysicalExpr struct MySerializer { field: String, cache: HashMap<String, Arc<dyn PhysicalExpr>>, } impl Deserializer for MySerializer { fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> { // can run arbitrary stuff before, short circuit, etc. println!("Deserializing using MySerializer {}", self.field); let res = deserialize_physical_expr(&mut *self, proto)?; // can run arbitrary stuff after and replace the result if necessary println!("Deserialized result: {:?}", res); Ok(res) } } // Users code: wants to support deserialization of UDFs struct MyDeserializerWithUDF<'a> { registry: &'a dyn FunctionRegistry, } impl Deserializer for MyDeserializerWithUDF<'_> { fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> { deserialize_physical_expr(self, proto) } fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> { deserialize_udf(self, self.registry, proto) } } ``` Before continuing down this path I want go gauge appetite for this change, and understand if it satisfies the use cases currently out there or falls short. @timsaucer @milenkovicm @Jefffrey any thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
