These APIs predate Datasets / encoders, so that is why they are Row instead of objects. We should probably rethink that.
Honestly, I usually end up using the column expression version of explode now that it exists (i.e. explode($"arrayCol").as("Item")). It would be great to understand more why you are using these instead. On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com> wrote: > we currently have 2 explode definitions in Dataset: > > def explode[A <: Product : TypeTag](input: Column*)(f: Row => > TraversableOnce[A]): DataFrame > > def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f: > A => TraversableOnce[B]): DataFrame > > 1) the separation of the functions into their own argument lists is nice, > but unfortunately scala's type inference doesn't handle this well, meaning > that the generic types always have to be explicitly provided. i assume this > was done to allow the "input" to be a varargs in the first method, and then > kept the same in the second for reasons of symmetry. > > 2) i am surprised the first definition returns a DataFrame. this seems to > suggest DataFrame usage (so DataFrame to DataFrame), but there is no way to > specify the output column names, which limits its usability for DataFrames. > i frequently end up using the first definition for DataFrames anyhow > because of the need to return more than 1 column (and the data has columns > unknown at compile time that i need to carry along making flatMap on > Dataset clumsy/unusable), but relying on the output columns being called _1 > and _2 and renaming then afterwards seems like an anti-pattern. > > 3) using Row objects isn't very pretty. why not f: A => TraversableOnce[B] > or something like that for the first definition? how about: > def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output: > Seq[Column])(f: A => TraversableOnce[B]): DataFrame > > best, > koert >