Hi Michael, Thanks for the response. I am just extracting part of the nested structure and returning only a piece that same structure.
I haven't looked at Encoders or Datasets since we're bound to 1.6 for now but I'll look at encoders to see if that covers it. Datasets seems like it would solve this problem for sure. I avoided returning a case object because even if we use reflection to build byte code and do it efficiently. I still need to convert my Row to a case object manually within my UDF, just to have it converted to a Row again. Even if it's fast, it's still fairly necessary. The thing I guess that threw me off was that UDF1/2/3 was in a "java" prefixed package although there was nothing that made it java specific and in fact was the only way to do what I wanted in scala. For things like JavaRDD, etc it makes sense, but for generic things like UDF is there a reason they get put into a package with "java" in the name? Regards, Hamel On Wed, Mar 30, 2016 at 3:47 PM Michael Armbrust <mich...@databricks.com> wrote: > Some answers and more questions inline > > - UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects >> as parameters. I cannot take in a case class object in place of the >> corresponding Row object, even if the schema matches because the Row object >> will always be passed in at Runtime and it will yield a ClassCastException. > > > This is true today, but could be improved using the new encoder > framework. Out of curiosity, have you look at that? If so, what is > missing thats leading you back to UDFs. > > Is there any way to return a Row object in scala from a UDF and specify >> the known schema that would be returned at UDF registration time? In >> python/java this seems to be the case because you need to explicitly >> specify return DataType of your UDF but using scala functions this isn't >> possible. I guess I could use the Java UDF1/2/3... API but I wanted to see >> if there was a first class scala way to do this. > > > I think UDF1/2/3 are the only way to do this today. Is the problem here > that you are only changing a subset of the nested data and you want to > preserve the structure. What kind of changes are you doing? > > 2) Is Spark actually converting the returned case class object when the >>> UDF is called, or does it use the fact that it's essentially "Product" to >>> efficiently coerce it to a Row in some way? >>> >> > We use reflection to figure out the schema and extract the data into the > internal row format. We actually runtime build bytecode for this in many > cases (though not all yet) so it can be pretty fast. > > >> 2.1) If this is the case, we could just take in a case object as a >> parameter (rather than a Row) and perform manipulation on that and return >> it. Is this explicitly something we avoided? > > > You can do this with Datasets: > > df.as[CaseClass].map(o => do stuff) >