GitHub user mbalassi opened a pull request:

    https://github.com/apache/flink/pull/1155

    [FLINK-2283] [streaming] Make grouped reduce/fold/aggregations stateful

    There is an open discussion at the related ticket [1] about fully removing 
the operators that I touch and partially remove here. I can accept both 
conclusions of the discussion, but even in the scenario when the operators get 
removed from the API afterwards the PR has certain merit to it:
    
    1. Cleans up the unused `StreamReduce` and `StreamFold` operators which 
should be removed either way.
    2. Adds an integration test for ensuring that not only user defined 
functions, but internal streaming operators can properly rely on the 
`OperatorState` interface. To do this it currently relies on the grouped 
reduce/fold aggregations, but this is just as important for windowing states, 
which are not properly checkpointed yet.
    3. Makes the grouped fold/reduce operators stateful, so that the previous 
test can be written.
    
    Some justification for the implementation choices:
    
    1. @gyfora has suggested to use the partitioned state at the ticket [1] 
instead of the manual map creation. In this scenario the grouped operators 
would not be unit testable any more as they would be dependent on the state 
partitioner information found in the keyed datastream. I decided against it.
    
    2. @StephanEwen has recently advised against adding unnecessary integration 
tests. [2] This is a feature that can only be tested as an integration test. I 
personally feel the need to cover internal operators with a checkpointing test 
despite the fact they currently use exactly the same mechanism as the UDFs as 
this implementation might be subject to slight changes.
    
    3. Elaborating on the previous point the `OperatorState` currently storing 
the internal state is also accessible for the user. This is an undesirable 
feature and might lead to accidental overwrite of the state. I am opening a 
Jira ticket for this. 
    
    [1] https://issues.apache.org/jira/browse/FLINK-2283
    [2] 
https://mail-archives.apache.org/mod_mbox/flink-dev/201509.mbox/%3CCANC1h_vvekciNVDzqCb8N4E5Kfzu4e1Mosnse1%3DV11HXnD2PBQ%40mail.gmail.com%3E

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mbalassi/flink aggregator-states

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1155
    
----
commit 29ca808ccb8a1d705927eabb492e70df5e5af06c
Author: mbalassi <[email protected]>
Date:   2015-09-11T14:32:09Z

    [streaming] Removed unused StreamReduce
    
    Refactored corresponding tests, some minor cleanups.

commit 4bd1dd035780402919bb5257274e9258457dadf3
Author: mbalassi <[email protected]>
Date:   2015-09-13T06:19:07Z

    [FLINK-2283] [streaming] grouped reduce and fold operators checkpoint state

commit 3688a7c98500179f454e1641aedd7758b1fdc644
Author: mbalassi <[email protected]>
Date:   2015-09-20T20:27:11Z

    [FLINK-2283] [streaming] Test for checkpointing in internal operators

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to