Spark SQL UDF Returning Rows

Hamel Kothari Wed, 30 Mar 2016 08:48:06 -0700

Hi all,

I've been trying for the last couple of days to define a UDF which takes in
a deeply nested Row object and performs some extraction to pull out a
portion of of the Row and return it. This row object is nested not just
with StructTypes but a bunch of ArrayTypes and MapTypes. From this complex
Row object I pull out what is essentially still a nested row with multiple
levels of depth. I'm just partially extracting the nested row.


After trying a variety of things, I've teased out the following constraints:

- UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects
as parameters. I cannot take in a case class object in place of the
corresponding Row object, even if the schema matches because the Row object
will always be passed in at Runtime and it will yield a ClassCastException.
- UDFs cannot return a Row object (understandably because you have no idea
of the schema being output in that case) but they can return a case object
which I can only assume will be restructured as a StructType after being
returned.

This results in a really weird/non-performant workflow when writing my UDF
because I'd have to perform all of my extraction on the Row objects. Then
convert that Row object to a case object manually (extracting everything
from the row once I've extracted the subobjects and building the deeply
nested case object) only to return it and have Spark convert that case
object to a Row object. Before I go ahead and do that I have a few
questions:

1) Is there any way to return a Row object in scala from a UDF and specify
the known schema that would be returned at UDF registration time? In
python/java this seems to be the case because you need to explicitly
specify return DataType of your UDF but using scala functions this isn't
possible. I guess I could use the Java UDF1/2/3... API but I wanted to see
if there was a first class scala way to do this.

2) Is Spark actually converting the returned case class object when the UDF
is called, or does it use the fact that it's essentially "Product" to
efficiently coerce it to a Row in some way?
    2.1) If this is the case, we could just take in a case object as a
parameter (rather than a Row) and perform manipulation on that and return
it. Is this explicitly something we avoided?

Thanks,
Hamel

Spark SQL UDF Returning Rows

Reply via email to