If your UDF itself is very CPU intensive, it probably won't make that much of 
difference, because the UDF itself will dwarf the serialization/deserialization 
overhead.

If your UDF is cheap, it will help tremendously.

On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
>  
> 
> 
> 
> I read online[1] that for a best UDF performance it is possible to
> implement them using internal Spark expressions, and I also saw a couple
> of pull requests such as [2] and [3] where this was put to practice (not
> sure if for that reason or just to extend the API).
> 
> 
> 
>  
> 
> 
> 
> We have an algorithm that computes a score similar to what the Levenshtein
> distance does and it takes about 30%-40% of the overall time of our job.
> We are looking for ways to improve it without adding more resources.
> 
> 
> 
>  
> 
> 
> 
> I was wondering if it would be advisable to implement it extending 
> BinaryExpression
> like[1] and if i t would result in any performance gains.
> 
> 
> 
>  
> 
> 
> 
> Thanks for your help!
> 
> 
> 
>  
> 
> 
> 
> [1] https:/ / hackernoon. com/ 
> apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
> (
> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
> )
> 
> 
> 
> [2] https:/ / github. com/ apache/ spark/ pull/ 7214 (
> https://github.com/apache/spark/pull/7214 )
> 
> 
> 
> [3] https:/ / github. com/ apache/ spark/ pull/ 7236 (
> https://github.com/apache/spark/pull/7236 )
> 
> 
>

Reply via email to