Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

Wenchen Fan Wed, 05 Feb 2020 00:06:28 -0800

This is a hack really and we don't recommend users to access internal
classes directly. That's why there is no public document.


If you really need to do it and are aware of the risks, you can read the
source code. All expressions (or the so-called "native UDF") extend the
base class `Expression`. You can read the code comments and look at some
implementations.

On Wed, Feb 5, 2020 at 11:11 AM <[email protected]> wrote:

> Is there any documentation/ sample about this besides the pull requests
> merged to spark core?
>
>
>
> It seems that I need to create my custom functions under the package
> *org.apache.spark.sql.** in order to be able to access some of the
> internal classes I saw in[1] such as Column[2]
>
>
>
> Could you please confirm if that’s how it should be?
>
>
>
> Thanks!
>
>
>
> [1] https://github.com/apache/spark/pull/7214
>
> [2]
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37
>
>
>
> *From:* Reynold Xin <[email protected]>
> *Sent:* Wednesday, January 22, 2020 2:22 AM
> *To:* [email protected]
> *Cc:* [email protected]
> *Subject:* Re: [SQL] Is it worth it (and advisable) to implement native
> UDFs?
>
>
>
> If your UDF itself is very CPU intensive, it probably won't make that much
> of difference, because the UDF itself will dwarf the
> serialization/deserialization overhead.
>
>
>
> If your UDF is cheap, it will help tremendously.
>
>
>
>
>
> On Mon, Jan 20, 2020 at 6:33 PM, <[email protected]> wrote:
>
> Hi,
>
>
>
> I read online[1] that for a best UDF performance it is possible to
> implement them using internal Spark expressions, and I also saw a couple of
> pull requests such as [2] and [3] where this was put to practice (not sure
> if for that reason or just to extend the API).
>
>
>
> We have an algorithm that computes a score similar to what the Levenshtein
> distance does and it takes about 30%-40% of the overall time of our job. We
> are looking for ways to improve it without adding more resources.
>
>
>
> I was wondering if it would be advisable to implement it extending 
> BinaryExpression
> like[1] and if it would result in any performance gains.
>
>
>
> Thanks for your help!
>
>
>
> [1]
> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
>
> [2] https://github.com/apache/spark/pull/7214
>
> [3] https://github.com/apache/spark/pull/7236
>
>
>

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

Reply via email to