Is there any documentation/ sample about this besides the pull requests merged 
to spark core?

 

It seems that I need to create my custom functions under the package 
org.apache.spark.sql.* in order to be able to access some of the internal 
classes I saw in[1] such as Column[2]

 

Could you please confirm if that’s how it should be?

 

Thanks!

 

[1] https://github.com/apache/spark/pull/7214

[2] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37
 

 

From: Reynold Xin <r...@databricks.com> 
Sent: Wednesday, January 22, 2020 2:22 AM
To: em...@yeikel.com
Cc: dev@spark.apache.org
Subject: Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

 

  
<https://r.superhuman.com/Dd8uXfQcJohyMvhLOn3aqVsTZa3RdFBJPsUUr_dlog2VG11E1e82IkBbF3kBBymivY9nQTEl6YyZ75qkdkrNKIab-ZiQZnpFKxBMbCD68X_aZP0tZFX2aKKjwD8BxV1YeeNquiifnXHyGLyK6BWyf37y0KtR1f6B03NV5eWY9Vh6iK8t-MwvNQ.gif>
 

If your UDF itself is very CPU intensive, it probably won't make that much of 
difference, because the UDF itself will dwarf the serialization/deserialization 
overhead.

 

If your UDF is cheap, it will help tremendously.

 

 

On Mon, Jan 20, 2020 at 6:33 PM, <em...@yeikel.com <mailto:em...@yeikel.com> > 
wrote:

Hi, 

 

I read online[1] that for a best UDF performance it is possible to implement 
them using internal Spark expressions, and I also saw a couple of pull requests 
such as [2] and [3] where this was put to practice (not sure if for that reason 
or just to extend the API). 

 

We have an algorithm that computes a score similar to what the Levenshtein 
distance does and it takes about 30%-40% of the overall time of our job. We are 
looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending 
BinaryExpression like[1] and if it would result in any performance gains. 

 

Thanks for your help!

 

[1] 
https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
 

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236

 

Reply via email to