Is there any documentation/ sample about this besides the pull requests merged to spark core?
It seems that I need to create my custom functions under the package org.apache.spark.sql.* in order to be able to access some of the internal classes I saw in[1] such as Column[2] Could you please confirm if that’s how it should be? Thanks! [1] https://github.com/apache/spark/pull/7214 [2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37 From: Reynold Xin <r...@databricks.com> Sent: Wednesday, January 22, 2020 2:22 AM To: em...@yeikel.com Cc: dev@spark.apache.org Subject: Re: [SQL] Is it worth it (and advisable) to implement native UDFs? <https://r.superhuman.com/Dd8uXfQcJohyMvhLOn3aqVsTZa3RdFBJPsUUr_dlog2VG11E1e82IkBbF3kBBymivY9nQTEl6YyZ75qkdkrNKIab-ZiQZnpFKxBMbCD68X_aZP0tZFX2aKKjwD8BxV1YeeNquiifnXHyGLyK6BWyf37y0KtR1f6B03NV5eWY9Vh6iK8t-MwvNQ.gif> If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead. If your UDF is cheap, it will help tremendously. On Mon, Jan 20, 2020 at 6:33 PM, <em...@yeikel.com <mailto:em...@yeikel.com> > wrote: Hi, I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API). We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources. I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains. Thanks for your help! [1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11 [2] https://github.com/apache/spark/pull/7214 [3] https://github.com/apache/spark/pull/7236