Hi,
I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API). We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources. I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains. Thanks for your help! [1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-c f2397cac11 [2] https://github.com/apache/spark/pull/7214 [3] https://github.com/apache/spark/pull/7236