For a general-purpose code example, you may take a look at the class we
defined in Transport UDFs to express all Expression UDFs [1]. This is an
internal class though and not a user-facing API. User-facing UDF example is
in [2]. It leverages [1] behind the scenes.

[1]
https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala
[2[
https://github.com/linkedin/transport/blob/master/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples/MapFromTwoArraysFunction.java

Thanks,
Walaa.

On Wed, Feb 5, 2020 at 12:06 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> This is a hack really and we don't recommend users to access internal
> classes directly. That's why there is no public document.
>
> If you really need to do it and are aware of the risks, you can read the
> source code. All expressions (or the so-called "native UDF") extend the
> base class `Expression`. You can read the code comments and look at some
> implementations.
>
> On Wed, Feb 5, 2020 at 11:11 AM <em...@yeikel.com> wrote:
>
>> Is there any documentation/ sample about this besides the pull requests
>> merged to spark core?
>>
>>
>>
>> It seems that I need to create my custom functions under the package
>> *org.apache.spark.sql.** in order to be able to access some of the
>> internal classes I saw in[1] such as Column[2]
>>
>>
>>
>> Could you please confirm if that’s how it should be?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> [1] https://github.com/apache/spark/pull/7214
>>
>> [2]
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37
>>
>>
>>
>> *From:* Reynold Xin <r...@databricks.com>
>> *Sent:* Wednesday, January 22, 2020 2:22 AM
>> *To:* em...@yeikel.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: [SQL] Is it worth it (and advisable) to implement native
>> UDFs?
>>
>>
>>
>> If your UDF itself is very CPU intensive, it probably won't make that
>> much of difference, because the UDF itself will dwarf the
>> serialization/deserialization overhead.
>>
>>
>>
>> If your UDF is cheap, it will help tremendously.
>>
>>
>>
>>
>>
>> On Mon, Jan 20, 2020 at 6:33 PM, <em...@yeikel.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I read online[1] that for a best UDF performance it is possible to
>> implement them using internal Spark expressions, and I also saw a couple of
>> pull requests such as [2] and [3] where this was put to practice (not sure
>> if for that reason or just to extend the API).
>>
>>
>>
>> We have an algorithm that computes a score similar to what the
>> Levenshtein distance does and it takes about 30%-40% of the overall time of
>> our job. We are looking for ways to improve it without adding more
>> resources.
>>
>>
>>
>> I was wondering if it would be advisable to implement it extending 
>> BinaryExpression
>> like[1] and if it would result in any performance gains.
>>
>>
>>
>> Thanks for your help!
>>
>>
>>
>> [1]
>> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
>>
>> [2] https://github.com/apache/spark/pull/7214
>>
>> [3] https://github.com/apache/spark/pull/7236
>>
>>
>>
>

Reply via email to