Hi all, I've been trying for the last couple of days to define a UDF which takes in a deeply nested Row object and performs some extraction to pull out a portion of of the Row and return it. This row object is nested not just with StructTypes but a bunch of ArrayTypes and MapTypes. From this complex Row object I pull out what is essentially still a nested row with multiple levels of depth. I'm just partially extracting the nested row.
After trying a variety of things, I've teased out the following constraints: - UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects as parameters. I cannot take in a case class object in place of the corresponding Row object, even if the schema matches because the Row object will always be passed in at Runtime and it will yield a ClassCastException. - UDFs cannot return a Row object (understandably because you have no idea of the schema being output in that case) but they can return a case object which I can only assume will be restructured as a StructType after being returned. This results in a really weird/non-performant workflow when writing my UDF because I'd have to perform all of my extraction on the Row objects. Then convert that Row object to a case object manually (extracting everything from the row once I've extracted the subobjects and building the deeply nested case object) only to return it and have Spark convert that case object to a Row object. Before I go ahead and do that I have a few questions: 1) Is there any way to return a Row object in scala from a UDF and specify the known schema that would be returned at UDF registration time? In python/java this seems to be the case because you need to explicitly specify return DataType of your UDF but using scala functions this isn't possible. I guess I could use the Java UDF1/2/3... API but I wanted to see if there was a first class scala way to do this. 2) Is Spark actually converting the returned case class object when the UDF is called, or does it use the fact that it's essentially "Product" to efficiently coerce it to a Row in some way? 2.1) If this is the case, we could just take in a case object as a parameter (rather than a Row) and perform manipulation on that and return it. Is this explicitly something we avoided? Thanks, Hamel