koert kuipers created SPARK-15769:
-------------------------------------

             Summary: Add Encoder for input type to Aggregator
                 Key: SPARK-15769
                 URL: https://issues.apache.org/jira/browse/SPARK-15769
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: koert kuipers
            Priority: Minor


Currently org.apache.spark.sql.expressions.Aggregator has Encoders for its 
buffer and output type, but not for its input type. The thought is that the 
input type is known from the Dataset it operates on and hence can be inserted 
later.

However i think there are compelling reasons to have Aggregator carry an 
Encoder for its input type:
* Generally transformations on Dataset only require the Encoder for the result 
type since the input type is exactly known and it's Encoder is already 
available within the Dataset. However this is not the case for an Aggregator: 
an Aggregator is defined independently of a Dataset, and i think it should be 
generally desirable that an Aggregator work on any type that can safely be cast 
to the Aggregator's input type (for example an Aggregator that has Long as 
input should work on a Dataset of Ints).
* Aggregators should also work on DataFrames, because its a much nicer API to 
use than UserDefinedAggregateFunction. And when operating on DataFrames you 
should not have to use Row objects, which means your input type is not equal to 
the type of the Dataset you operate on (so the Encoder of the Dataset that is 
operated on should not be used as input Encoder for the Aggregator).
* Adding an input Encoder is not a big burden, since it can typically be 
created implicitly
* it removes TypedColumn.withInputType and its usage in Dataset, 
KeyValueGroupedDataset and RelationalGroupedDataset, which always felt somewhat 
ad-hoc to me
* Once an Aggregator has an Encoder for it's input type it is a small change to 
make the Aggregator also work on a subset of columns in a DataFrame, which 
facilitates Aggregator re-use since you don't have to write a custom Aggregator 
to extract the columns from a specific DataFrame. This also enables a usage 
that is more typical within a DataFrame context, very similar to how a 
UserDefinedAggregateFunction is used.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to