[GitHub] flink pull request: [FLINK-2237] [runtime] Add hash-based combiner...

fhueske Tue, 19 Jan 2016 07:19:53 -0800

Github user fhueske commented on the pull request:

    https://github.com/apache/flink/pull/1517#issuecomment-172884984
  
    Hi @ggevay, thanks a lot for this PR! A hash-based combiner is a long 
desired feature on our list!
    I haven't looked at the code yet, but I have a comments already.
    
    You said, the combiner flushes the full hash table if it runs out of 
memory. Do you think it would be possible to track the update frequency of 
buckets and only flush the bucket with the least updates (or n buckets with 
least updates)? This might improve the performance for skewed input data. 
    
    Regarding your future plans:
    - I agree that the sort-based strategy for reduce combiners can be removed 
eventually, when the hash-based strategy proves to work well. I would like to 
give it a bit more exposure though, before we drop the code.
    - Porting the built-in aggregation functions, distinct, etc. from 
`GroupReduceFunction`s to `ReduceFunction`s sounds good. I think the reason for 
this design decision was that for the sort-based strategies `ReduceFunction`s 
did not had a benefit over `GroupReduceFunction`s. Instead they caused more 
method invocations (once for each input record) compared to once per key.
    - It would be great if you could open a JIRA and add a short design 
document / API proposal for the changes on the serializers that you talked 
about. This would allow the community to review and discuss your proposal.
    
    I will shepherd this PR and hope to review it soon.
    
    Thanks again, Fabian



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2237] [runtime] Add hash-based combiner...

Reply via email to