[ 
https://issues.apache.org/jira/browse/FLINK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591580#comment-14591580
 ] 

Fabian Hueske commented on FLINK-2237:
--------------------------------------

Looking at the code of the other hash tables is a very good idea. Before, 
starting with a Driver implementation, I would recommend to build the hash 
table stand-alone and add extensive unit tests to verify that it works 
correctly.

I don't think that a two-pass hash algorithm for combining and reducing would 
be very beneficial performance-wise. As [~StephanEwen] said on the mailing 
list, a hash-based combiner is expected to give the best performance 
improvement. Such a hash-based combiner would work by incrementally updating 
the (pre-)aggregated result in a hash table in a single-pass. There are 
basically two implementation designs for a combiner hash table that is 
incrementally updated.

1. A hash table for records with a fixed-length binary representation. If the 
length of an aggregated record does not change in the Combine operation (either 
because it is inherently fixed length like a Tuple2<Integer, Integer> or 
because we know, that the Combine function only updates fixed-length fields of 
a record) the hash table can update the record in-place.
2. A hash table for records with variable-length binary representation. If the 
length of the binary representation of a record changes in the Combine 
operation, the hash table cannot be updated in place. Instead, new records need 
to be appended and the old one invalidated. Periodically, the table needs to be 
compacted.

The hash table for fixed-length records is certainly easier to realized and 
more efficient. Hence, I would recommend to work on that.

> Add hash-based Aggregation
> --------------------------
>
>                 Key: FLINK-2237
>                 URL: https://issues.apache.org/jira/browse/FLINK-2237
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Rafiullah Momand
>            Priority: Minor
>              Labels: github-import
>             Fix For: pre-apache
>
>
> Aggregation functions at the moment are implemented in a sort-based way.
> How can we implement hash based Aggregation for Flink?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to