Github user fhueske commented on the pull request:
https://github.com/apache/flink/pull/1517#issuecomment-172884984
Hi @ggevay, thanks a lot for this PR! A hash-based combiner is a long
desired feature on our list!
I haven't looked at the code yet, but I have a comments already.
You said, the combiner flushes the full hash table if it runs out of
memory. Do you think it would be possible to track the update frequency of
buckets and only flush the bucket with the least updates (or n buckets with
least updates)? This might improve the performance for skewed input data.
Regarding your future plans:
- I agree that the sort-based strategy for reduce combiners can be removed
eventually, when the hash-based strategy proves to work well. I would like to
give it a bit more exposure though, before we drop the code.
- Porting the built-in aggregation functions, distinct, etc. from
`GroupReduceFunction`s to `ReduceFunction`s sounds good. I think the reason for
this design decision was that for the sort-based strategies `ReduceFunction`s
did not had a benefit over `GroupReduceFunction`s. Instead they caused more
method invocations (once for each input record) compared to once per key.
- It would be great if you could open a JIRA and add a short design
document / API proposal for the changes on the serializers that you talked
about. This would allow the community to review and discuss your proposal.
I will shepherd this PR and hope to review it soon.
Thanks again, Fabian
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---