Hi, 

 

I read online[1] that for a best UDF performance it is possible to implement
them using internal Spark expressions, and I also saw a couple of pull
requests such as [2] and [3] where this was put to practice (not sure if for
that reason or just to extend the API). 

 

We have an algorithm that computes a score similar to what the Levenshtein
distance does and it takes about 30%-40% of the overall time of our job. We
are looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending
BinaryExpression like[1] and if it would result in any performance gains. 

 

Thanks for your help!

 

[1]
https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-c
f2397cac11 

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236 

 

Reply via email to