If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead.
If your UDF is cheap, it will help tremendously. On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote: > > > > Hi, > > > > > > > > I read online[1] that for a best UDF performance it is possible to > implement them using internal Spark expressions, and I also saw a couple > of pull requests such as [2] and [3] where this was put to practice (not > sure if for that reason or just to extend the API). > > > > > > > > We have an algorithm that computes a score similar to what the Levenshtein > distance does and it takes about 30%-40% of the overall time of our job. > We are looking for ways to improve it without adding more resources. > > > > > > > > I was wondering if it would be advisable to implement it extending > BinaryExpression > like[1] and if i t would result in any performance gains. > > > > > > > > Thanks for your help! > > > > > > > > [1] https:/ / hackernoon. com/ > apache-spark-tips-and-tricks-for-better-performance-cf2397cac11 > ( > https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11 > ) > > > > [2] https:/ / github. com/ apache/ spark/ pull/ 7214 ( > https://github.com/apache/spark/pull/7214 ) > > > > [3] https:/ / github. com/ apache/ spark/ pull/ 7236 ( > https://github.com/apache/spark/pull/7236 ) > > >