For a general-purpose code example, you may take a look at the class we defined in Transport UDFs to express all Expression UDFs [1]. This is an internal class though and not a user-facing API. User-facing UDF example is in [2]. It leverages [1] behind the scenes.
[1] https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala [2[ https://github.com/linkedin/transport/blob/master/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples/MapFromTwoArraysFunction.java Thanks, Walaa. On Wed, Feb 5, 2020 at 12:06 AM Wenchen Fan <cloud0...@gmail.com> wrote: > This is a hack really and we don't recommend users to access internal > classes directly. That's why there is no public document. > > If you really need to do it and are aware of the risks, you can read the > source code. All expressions (or the so-called "native UDF") extend the > base class `Expression`. You can read the code comments and look at some > implementations. > > On Wed, Feb 5, 2020 at 11:11 AM <em...@yeikel.com> wrote: > >> Is there any documentation/ sample about this besides the pull requests >> merged to spark core? >> >> >> >> It seems that I need to create my custom functions under the package >> *org.apache.spark.sql.** in order to be able to access some of the >> internal classes I saw in[1] such as Column[2] >> >> >> >> Could you please confirm if that’s how it should be? >> >> >> >> Thanks! >> >> >> >> [1] https://github.com/apache/spark/pull/7214 >> >> [2] >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37 >> >> >> >> *From:* Reynold Xin <r...@databricks.com> >> *Sent:* Wednesday, January 22, 2020 2:22 AM >> *To:* em...@yeikel.com >> *Cc:* dev@spark.apache.org >> *Subject:* Re: [SQL] Is it worth it (and advisable) to implement native >> UDFs? >> >> >> >> If your UDF itself is very CPU intensive, it probably won't make that >> much of difference, because the UDF itself will dwarf the >> serialization/deserialization overhead. >> >> >> >> If your UDF is cheap, it will help tremendously. >> >> >> >> >> >> On Mon, Jan 20, 2020 at 6:33 PM, <em...@yeikel.com> wrote: >> >> Hi, >> >> >> >> I read online[1] that for a best UDF performance it is possible to >> implement them using internal Spark expressions, and I also saw a couple of >> pull requests such as [2] and [3] where this was put to practice (not sure >> if for that reason or just to extend the API). >> >> >> >> We have an algorithm that computes a score similar to what the >> Levenshtein distance does and it takes about 30%-40% of the overall time of >> our job. We are looking for ways to improve it without adding more >> resources. >> >> >> >> I was wondering if it would be advisable to implement it extending >> BinaryExpression >> like[1] and if it would result in any performance gains. >> >> >> >> Thanks for your help! >> >> >> >> [1] >> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11 >> >> [2] https://github.com/apache/spark/pull/7214 >> >> [3] https://github.com/apache/spark/pull/7236 >> >> >> >