Hi,

At LinkedIn, we have some benchmarks that show that UDFs in the
Expression API are more performant than Hive Generic UDFs (I am not
sure which APIs you used to implement your baseline, but I expect
Scala UDFs or Hive Generic UDFs). In fact, we have built a full
fledged UDF API (scalar for now) on top of Spark expressions/internal
rows. You may take a look at it [1]. The same API is reusable for some
other engines/data formats.

[1] https://github.com/linkedin/transport

Thanks,
Walaa.




On Mon, Jan 20, 2020 at 6:34 PM <em...@yeikel.com> wrote:
>
> Hi,
>
>
>
> I read online[1] that for a best UDF performance it is possible to implement 
> them using internal Spark expressions, and I also saw a couple of pull 
> requests such as [2] and [3] where this was put to practice (not sure if for 
> that reason or just to extend the API).
>
>
>
> We have an algorithm that computes a score similar to what the Levenshtein 
> distance does and it takes about 30%-40% of the overall time of our job. We 
> are looking for ways to improve it without adding more resources.
>
>
>
> I was wondering if it would be advisable to implement it extending 
> BinaryExpression like[1] and if it would result in any performance gains.
>
>
>
> Thanks for your help!
>
>
>
> [1] 
> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
>
> [2] https://github.com/apache/spark/pull/7214
>
> [3] https://github.com/apache/spark/pull/7236
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to