[
https://issues.apache.org/jira/browse/FLINK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591580#comment-14591580
]
Fabian Hueske commented on FLINK-2237:
--------------------------------------
Looking at the code of the other hash tables is a very good idea. Before,
starting with a Driver implementation, I would recommend to build the hash
table stand-alone and add extensive unit tests to verify that it works
correctly.
I don't think that a two-pass hash algorithm for combining and reducing would
be very beneficial performance-wise. As [~StephanEwen] said on the mailing
list, a hash-based combiner is expected to give the best performance
improvement. Such a hash-based combiner would work by incrementally updating
the (pre-)aggregated result in a hash table in a single-pass. There are
basically two implementation designs for a combiner hash table that is
incrementally updated.
1. A hash table for records with a fixed-length binary representation. If the
length of an aggregated record does not change in the Combine operation (either
because it is inherently fixed length like a Tuple2<Integer, Integer> or
because we know, that the Combine function only updates fixed-length fields of
a record) the hash table can update the record in-place.
2. A hash table for records with variable-length binary representation. If the
length of the binary representation of a record changes in the Combine
operation, the hash table cannot be updated in place. Instead, new records need
to be appended and the old one invalidated. Periodically, the table needs to be
compacted.
The hash table for fixed-length records is certainly easier to realized and
more efficient. Hence, I would recommend to work on that.
> Add hash-based Aggregation
> --------------------------
>
> Key: FLINK-2237
> URL: https://issues.apache.org/jira/browse/FLINK-2237
> Project: Flink
> Issue Type: New Feature
> Reporter: Rafiullah Momand
> Priority: Minor
> Labels: github-import
> Fix For: pre-apache
>
>
> Aggregation functions at the moment are implemented in a sort-based way.
> How can we implement hash based Aggregation for Flink?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)