[GitHub] [flink] twalthr opened a new pull request #16793: Flink 20897

GitBox Thu, 12 Aug 2021 05:49:19 -0700


twalthr opened a new pull request #16793:
URL: https://github.com/apache/flink/pull/16793



   ## What is the purpose of the change
   
   This enables batch mode for `StreamTableEnvironment`.
   
   Both `StreamExecutionEnvironment`, `TableEnvironment`, and 
`StreamTableEnvironment` use `StreamGraphGenerator` with the same 
configuration. Previous work ensured that when `execution.runtime-mode` is set 
to `BATCH` all batch properties are either set consistently (e.g. shuffle mode) 
or have no impact on the pipeline (e.g. auto watermark interval, state 
backends).
   
   Most of the changes are removing checks and ensuring that internal (e.g. 
values) and external (e.g. data stream, table source) source transformations 
are set to `BOUNDED`. The latter is a complex topic as we currently use 4 
different ways of expressing external sources:
   
   - `InputFormat`: Boundedness needs to be explicitly set by the planner due 
to custom formats that don't extend from `FileInputFormat`.
   - `SourceFunctionProvider`: Boundedness needs to be explicitly set by the 
planner via custom transformation to also disable progressive watermarks.
   - `DataStreamScanProvider`: Boundedness needs to be explicitly set by the 
planner to ensure old behavior again. New source interfaces + `FileInputFormat` 
are fine.
   - `TransformationScanProvider`: Boundedness can be derived automatically and 
will only work with new source interfaces + `FileInputFormat`.
   
   ## Brief change log
   
   - Fully rely on `StreamGraphGenerator` to enable batch mode and batch 
properties
   - Set boundedness of sources to satisfy checks in `StreamGraphGenerator`
   
   ## Verifying this change
   
   This change added tests and can be verified as follows: 
`DataStreamJavaITCase` and manual tests and experiments
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: yes
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): yes
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? not documented yet
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] twalthr opened a new pull request #16793: Flink 20897

Reply via email to