thinkharderdev commented on issue #8706:
URL: 
https://github.com/apache/arrow-datafusion/issues/8706#issuecomment-1910968023

   Hey @yyy1000 I think there is some work to do here. Currently the 
serialization for udf looks like 
   
   ```
                       ScalarFunctionDefinition::UDF(fun) => Self {
                           expr_type: Some(ExprType::ScalarUdfExpr(
                               protobuf::ScalarUdfExprNode {
                                   fun_name: fun.name().to_string(),
                                   args,
                               },
                           )),
                       },
   ```
   
   eg, we just use the `name` and assume wherever it is being deserialized will 
just have a registry where it can look up the scalar function definition by 
name. 
   
   But ideally we would be able to serialize a custom scalar function that has 
some sort of associated state. For example, a regex scalar function that 
actually contains the compiled regex in it's struct definition like:
   
   ```
   struct MyRegexUdf {
      compiled_regex: Vec<u8> // just assume we have some serialization of the 
regex state machine here
   }
   
   impl ScalarUDFImpl for MyRegexUdf {
     fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
        // do something with compiled regex here
     }
   }
   
   ```
   
   Currently the mechanism that is used for this kind of thing is to define a 
custom `LogicalExtensionCodec`. So here I think we would add methods to that 
trait like 
   
   ```
   impl LogicalExtensionCodec for DefaultLogicalExtensionCodec {
   
     fn try_encode_scalar_udf(
           &self,
           _node: Arc<dyn ScalarUdfImpl>,
           _buf: &mut Vec<u8>,
       ) -> Result<()> {
           not_impl_err!("LogicalExtensionCodec is not provided")
       };
   
   
      fn try_decode_scalar_udf(
           &self,
           _buf: &[u8],
           _ctx: &SessionContext,
       ) -> Result<Arc<dyn ScalarUDFImpl>> {
           not_impl_err!("LogicalExtensionCodec is not provided")
       }
   }
   
   ```
   
   So then I would be able to define my own UDFs that contain internal state 
and then define an extension codec like
   ```
   struct MyLogicalExtensionCodec;
   
   impl LogicalExentionCodec for MyLogicalExtensionCodec {
       fn try_encode_scalar_udf(
           &self,
           node: Arc<dyn ScalarUdfImpl>,
           buf: &mut Vec<u8>,
       ) -> Result<()> {
           if let Some(regex_udf) = node.as_any().downcast_ref::<MyRegexUdf> {
              let proto = MyRegexUdfProto {
                compiled_regex: regex.compiled_regex.clone()
              }
   
              proto.encode(buf)?;
   
              Ok(())
           } else {
              not_impl_err!("LogicalExtensionCodec is not provided")
           }
       }; 
   
      fn try_decode_scalar_udf(
           &self,
           buf: &[u8],
           _ctx: &SessionContext,
       ) -> Result<Arc<dyn ScalarUDFImpl>> {
           if let Ok(proto) = MyRegexUdfProto::decode(buf) {
              Ok(Arc::new(MyRegexUdf { compiled_regex: proto.compiled_regex)))
           } else {
              not_impl_err!("LogicalExtensionCodec is not provided")
           }
       }
   
   }
   
   ```
   
   However, this doesn't play very nicely with how the serde is currently 
defined because we have no way to get a `LogicalExtensionCodec` in our `impl 
TryFrom<&Expr> for protobuf::LogicalExprNode` which we would need


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to