Re: Spark SQL UDF Returning Rows

Michael Armbrust Wed, 30 Mar 2016 12:48:15 -0700

Some answers and more questions inline

- UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects
> as parameters. I cannot take in a case class object in place of the
> corresponding Row object, even if the schema matches because the Row object
> will always be passed in at Runtime and it will yield a ClassCastException.



This is true today, but could be improved using the new encoder framework.
Out of curiosity, have you look at that?  If so, what is missing thats
leading you back to UDFs.

Is there any way to return a Row object in scala from a UDF and specify the
> known schema that would be returned at UDF registration time? In
> python/java this seems to be the case because you need to explicitly
> specify return DataType of your UDF but using scala functions this isn't
> possible. I guess I could use the Java UDF1/2/3... API but I wanted to see
> if there was a first class scala way to do this.


I think UDF1/2/3 are the only way to do this today.  Is the problem here
that you are only changing a subset of the nested data and you want to
preserve the structure.  What kind of changes are you doing?

2) Is Spark actually converting the returned case class object when the UDF
>> is called, or does it use the fact that it's essentially "Product" to
>> efficiently coerce it to a Row in some way?
>>
>
We use reflection to figure out the schema and extract the data into the
internal row format.  We actually runtime build bytecode for this in many
cases (though not all yet) so it can be pretty fast.


>     2.1) If this is the case, we could just take in a case object as a
> parameter (rather than a Row) and perform manipulation on that and return
> it. Is this explicitly something we avoided?


You can do this with Datasets:

df.as[CaseClass].map(o => do stuff)

Re: Spark SQL UDF Returning Rows

Reply via email to