[ 
https://issues.apache.org/jira/browse/SPARK-15769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-15769:
---------------------------------
    Labels: bulk-closed  (was: )

> Add Encoder for input type to Aggregator
> ----------------------------------------
>
>                 Key: SPARK-15769
>                 URL: https://issues.apache.org/jira/browse/SPARK-15769
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: koert kuipers
>            Priority: Minor
>              Labels: bulk-closed
>
> Currently org.apache.spark.sql.expressions.Aggregator has Encoders for its 
> buffer and output type, but not for its input type. The thought is that the 
> input type is known from the Dataset it operates on and hence can be inserted 
> later.
> However i think there are compelling reasons to have Aggregator carry an 
> Encoder for its input type:
> * Generally transformations on Dataset only require the Encoder for the 
> result type since the input type is exactly known and it's Encoder is already 
> available within the Dataset. However this is not the case for an Aggregator: 
> an Aggregator is defined independently of a Dataset, and i think it should be 
> generally desirable that an Aggregator work on any type that can safely be 
> cast to the Aggregator's input type (for example an Aggregator that has Long 
> as input should work on a Dataset of Ints).
> * Aggregators should also work on DataFrames, because its a much nicer API to 
> use than UserDefinedAggregateFunction. And when operating on DataFrames you 
> should not have to use Row objects, which means your input type is not equal 
> to the type of the Dataset you operate on (so the Encoder of the Dataset that 
> is operated on should not be used as input Encoder for the Aggregator).
> * Adding an input Encoder is not a big burden, since it can typically be 
> created implicitly
> * It removes TypedColumn.withInputType and its usage in Dataset, 
> KeyValueGroupedDataset and RelationalGroupedDataset, which always felt 
> somewhat ad-hoc to me
> * Once an Aggregator has an Encoder for it's input type it is a small change 
> to make the Aggregator also work on a subset of columns in a DataFrame, which 
> facilitates Aggregator re-use since you don't have to write a custom 
> Aggregator to extract the columns from a specific DataFrame. This also 
> enables a usage that is more typical within a DataFrame context, very similar 
> to how a UserDefinedAggregateFunction is used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to