Re: feedback on dataset api explode

Reynold Xin Wed, 25 May 2016 13:44:15 -0700

Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533


@Koert - Please keep API feedback coming. One thing - in the future, can
you send api feedbacks to the dev@ list instead of user@?



On Wed, May 25, 2016 at 1:05 PM, Cheng Lian <l...@databricks.com> wrote:

> Agree, since they can be easily replaced by .flatMap (to do explosion) and
> .select (to rename output columns)
>
> Cheng
>
>
> On 5/25/16 12:30 PM, Reynold Xin wrote:
>
> Based on this discussion I'm thinking we should deprecate the two explode
> functions.
>
> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com>
> ko...@tresata.com> wrote:
>
>> wenchen,
>> that definition of explode seems identical to flatMap, so you dont need
>> it either?
>>
>> michael,
>> i didn't know about the column expression version of explode, that makes
>> sense. i will experiment with that instead.
>>
>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <wenc...@databricks.com>
>> wrote:
>>
>>> I think we only need this version:  `def explode[B : Encoder](f: A
>>> => TraversableOnce[B]): Dataset[B]`
>>>
>>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>>> the best choice.
>>>
>>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> These APIs predate Datasets / encoders, so that is why they are Row
>>>> instead of objects.  We should probably rethink that.
>>>>
>>>> Honestly, I usually end up using the column expression version of
>>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")).  It
>>>> would be great to understand more why you are using these instead.
>>>>
>>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> we currently have 2 explode definitions in Dataset:
>>>>>
>>>>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>>>>> TraversableOnce[A]): DataFrame
>>>>>
>>>>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>>>>> String)(f: A => TraversableOnce[B]): DataFrame
>>>>>
>>>>> 1) the separation of the functions into their own argument lists is
>>>>> nice, but unfortunately scala's type inference doesn't handle this well,
>>>>> meaning that the generic types always have to be explicitly provided. i
>>>>> assume this was done to allow the "input" to be a varargs in the first
>>>>> method, and then kept the same in the second for reasons of symmetry.
>>>>>
>>>>> 2) i am surprised the first definition returns a DataFrame. this seems
>>>>> to suggest DataFrame usage (so DataFrame to DataFrame), but there is no 
>>>>> way
>>>>> to specify the output column names, which limits its usability for
>>>>> DataFrames. i frequently end up using the first definition for DataFrames
>>>>> anyhow because of the need to return more than 1 column (and the data has
>>>>> columns unknown at compile time that i need to carry along making flatMap
>>>>> on Dataset clumsy/unusable), but relying on the output columns being 
>>>>> called
>>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>>>>
>>>>> 3) using Row objects isn't very pretty. why not f: A =>
>>>>> TraversableOnce[B] or something like that for the first definition? how
>>>>> about:
>>>>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>>>>
>>>>> best,
>>>>> koert
>>>>>
>>>>
>>>>
>>>
>>
>

Re: feedback on dataset api explode

Reply via email to