[jira] [Resolved] (SPARK-16954) UDFs should allow output type to be specified in terms of the input type

Hyukjin Kwon (JIRA) Mon, 20 May 2019 22:10:09 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-16954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-16954.
----------------------------------
    Resolution: Incomplete

> UDFs should allow output type to be specified in terms of the input type
> ------------------------------------------------------------------------
>
>                 Key: SPARK-16954
>                 URL: https://issues.apache.org/jira/browse/SPARK-16954
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Simeon Simeonov
>            Priority: Major
>              Labels: SQL, UDF, bulk-closed, types
>
> Consider an {{array_compact}} UDF that removes {{null}} values from an array. 
> There is no easy way to implement this UDF because an explicit return type 
> with a {{TypeTag}} is required by code generation.
> The interesting observation here is that the output type of `array_compact` 
> is the same as its input type. In general, there is a broad class of UDFs, 
> especially collection-oriented ones, whose output types are functions of the 
> input types. In our Spark work we have found collection manipulation UDFs to 
> be very powerful for cleaning up data and substantially improving 
> performance, in particular, avoiding {{explode}} followed by {{groupBy}}. It 
> would be nice if Spark made adding these types of UDFs very easy.
> I won't go into possible ways to implement this under the covers as there are 
> many options but I do want to point out that it is possible to communicate 
> the right type information to Spark without changing the signature for UDF 
> registration using placeholder types, e.g.,
> {code}
> sealed trait UDFArgumentAtPosition
> case class ArgPos1 extends UDFArgumentAtPosition
> case class ArgPos2 extends UDFArgumentAtPosition
> // ...
> case class Struct[ArgPos <: UDFArgumentAtPosition](value: Row)
> case class ArrayElement[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: 
> A)
> case class MapKey[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
> case class MapValue[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
> // Functions are stubbed just to show compilation succeeds
> def arrayCompact[A : TypeTag](xs: Seq[A]): ArgPos1 = null
> def arraySum[A : Numeric : TypeTag](xs: Seq[A]): ArrayElement[ArgPos1, A] = 
> ArrayElement(implicitly[Numeric[A]].zero) 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-16954) UDFs should allow output type to be specified in terms of the input type

Reply via email to