[ https://issues.apache.org/jira/browse/FLINK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355606#comment-16355606 ]
ASF GitHub Bot commented on FLINK-8571: --------------------------------------- GitHub user StefanRRichter opened a pull request: https://github.com/apache/flink/pull/5424 FLINK-8571] [DataStream] Introduce utility function that reinterprets a data stream as keyed stream ## What is the purpose of the change This change introduces a utility function (`DataStreamUtils#reinterpretAsKeyedStream(...)`) that re-interprets any data stream as a keyed stream. Currently, the intended use-case are source that are already pre-partitioned w.r.t. the key group partitioning of Flink's keyBy. With this, for materialized shuffles that are picked up through a re-interpreted source, a job that uses keyed state can become embarrassingly parallel. This, in turn, allows for fine-grained recovery options where tasks can still make progress if other tasks fail. ## Brief change log - Moved `DataStreamUtils` to a better package. - Introduced utility functions to re-interpret data streams as keyed in `DataStreamUtils` and the counterpart from the Scala API. ## Verifying this change See `DataStreamUtilsTest#testReinterpretAsKeyedStream`. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (yes) - If yes, how is the feature documented? (Docs + JavaDocs) You can merge this pull request into a Git repository by running: $ git pull https://github.com/StefanRRichter/flink key-partitioned-source Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5424.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5424 ---- commit 45919ffc7ca064ba612c0493b93694111d18d430 Author: Stefan Richter <s.richter@...> Date: 2018-02-07T10:04:14Z Move DataStreamUtils to the datastream API package so that we can actually use it to expose package-private constructors or methods for experimental features. commit 8f9a2d78fbe0d97cf7c4997eba539a963ba7aee4 Author: Stefan Richter <s.richter@...> Date: 2018-02-02T17:39:51Z [FLINK-8571] [DataStream] Introduce utility function that reinterprets a data stream as keyed stream ---- > Provide an enhanced KeyedStream implementation to use ForwardPartitioner > ------------------------------------------------------------------------ > > Key: FLINK-8571 > URL: https://issues.apache.org/jira/browse/FLINK-8571 > Project: Flink > Issue Type: Improvement > Reporter: Nagarjun Guraja > Assignee: Stefan Richter > Priority: Major > > This enhancement would help in modeling problems with pre partitioned input > sources(for e.g. Kafka with Keyed topics). This would help in making the job > graph embarrassingly parallel while leveraging rocksdb state backend and also > the fine grained recovery semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)