mpurins-coralogix opened a new issue, #578: URL: https://github.com/apache/arrow-ballista/issues/578
**Describe the bug** When physical plan includes datafusion builtin function `starts_with` then it is serialized as UDF. **To Reproduce** Following test (can be added in somewhere in https://github.com/apache/arrow-ballista/blob/master/ballista/core/src/serde/physical_plan/mod.rs) should pass but it fails with `DataFusionError(Plan("There is no UDF named \"startswith\" in the registry"))` ```rust #[test] fn roundtrip_builtin_startswith_function() -> Result<()> { let field_a = Field::new("a", DataType::Utf8, false); let field_b = Field::new("b", DataType::Utf8, false); let schema = Arc::new(Schema::new(vec![field_a, field_b])); let input = Arc::new(EmptyExec::new(false, schema.clone())); let execution_props = ExecutionProps::new(); let fun_expr = functions::create_physical_fun( &BuiltinScalarFunction::StartsWith, &execution_props, )?; let expr = ScalarFunctionExpr::new( &format!("{}", BuiltinScalarFunction::StartsWith), // Note that this is how ballista creates this name in https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/functions.rs#L183 fun_expr, vec![col("a", &schema)?, col("b", &schema)?], &DataType::Boolean, ); let project = ProjectionExec::try_new(vec![(Arc::new(expr), "a".to_string())], input)?; roundtrip_test(Arc::new(project)) } ``` **Expected behavior** Above test should pass **Additional context** This is pretty much happening because UDFs and builtin functions are represented with the same `ScalarFunctionExpr` and ballista when serializing uses function name (which is `startswith`) and `BuiltinScalarFunction::from_str` (which matches only on `starts_with`) to differentiate between builtin and user defined functions. While simplest fix probably would be in datafusion `BuiltinScalarFunction::from_str` to handle `startswith` as function name as well I believe that it would be better to fix it in some other way where builtin and udf functions are thrown in the same bag. Current approach means that in some given project code could unexpectedly break on datafusion updates if datafusion introduces new builtin function and project already had user defined function under the same name. I wonder if it would be reasonable to separate builtin and user defined functions already in datafusion physical plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
