Re: Spark SQL UDF Returning Rows

Hamel Kothari Wed, 30 Mar 2016 10:35:45 -0700

Just to clarify, this is possible via UDF1/2/3 etc and registering those
with the desired return schema. It just felt wrong that the only way to do
this in scala was to use these classes which were in the Java package.
Maybe the relevant question is, why are these in a Java package?


On Wed, Mar 30, 2016 at 11:47 AM Hamel Kothari <[email protected]>
wrote:

> Hi all,
>
> I've been trying for the last couple of days to define a UDF which takes
> in a deeply nested Row object and performs some extraction to pull out a
> portion of of the Row and return it. This row object is nested not just
> with StructTypes but a bunch of ArrayTypes and MapTypes. From this complex
> Row object I pull out what is essentially still a nested row with multiple
> levels of depth. I'm just partially extracting the nested row.
>
> After trying a variety of things, I've teased out the following
> constraints:
>
> - UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects
> as parameters. I cannot take in a case class object in place of the
> corresponding Row object, even if the schema matches because the Row object
> will always be passed in at Runtime and it will yield a ClassCastException.
> - UDFs cannot return a Row object (understandably because you have no idea
> of the schema being output in that case) but they can return a case object
> which I can only assume will be restructured as a StructType after being
> returned.
>
> This results in a really weird/non-performant workflow when writing my UDF
> because I'd have to perform all of my extraction on the Row objects. Then
> convert that Row object to a case object manually (extracting everything
> from the row once I've extracted the subobjects and building the deeply
> nested case object) only to return it and have Spark convert that case
> object to a Row object. Before I go ahead and do that I have a few
> questions:
>
> 1) Is there any way to return a Row object in scala from a UDF and specify
> the known schema that would be returned at UDF registration time? In
> python/java this seems to be the case because you need to explicitly
> specify return DataType of your UDF but using scala functions this isn't
> possible. I guess I could use the Java UDF1/2/3... API but I wanted to see
> if there was a first class scala way to do this.
>
> 2) Is Spark actually converting the returned case class object when the
> UDF is called, or does it use the fact that it's essentially "Product" to
> efficiently coerce it to a Row in some way?
>     2.1) If this is the case, we could just take in a case object as a
> parameter (rather than a Row) and perform manipulation on that and return
> it. Is this explicitly something we avoided?
>
> Thanks,
> Hamel
>

Re: Spark SQL UDF Returning Rows

Reply via email to