morsapaes commented on a change in pull request #397:
URL: https://github.com/apache/flink-web/pull/397#discussion_r538129304
##########
File path: _posts/2020-12-04-release-1.12.0.md
##########
@@ -0,0 +1,332 @@
+---
+layout: post
+title: "Apache Flink 1.12.0 Release Announcement"
+date: 2020-12-04T08:00:00.000Z
+categories: news
+authors:
+- morsapaes:
+ name: "Marta Paes"
+ twitter: "morsapaes"
+- aljoscha:
+ name: "Aljoscha Krettek"
+ twitter: "aljoscha"
+
+excerpt: The Apache Flink community is excited to announce the release of
Flink 1.12.0! Close to 300 contributors worked on over 1k tickets to bring
significant improvements to usability as well as new features to Flink users
across the whole API stack. We're particularly excited about adding efficient
batch execution to the DataStream API, Kubernetes HA as an alternative to
ZooKeeper, support for upsert mode in the Kafka SQL connector and the new
Python DataStream API! Read on for all major new features and improvements,
important changes to be aware of and what to expect moving forward!
+---
+
+The Apache Flink community is excited to announce the release of Flink 1.12.0!
Close to 300 contributors worked on over 1k tickets to bring significant
improvements to usability as well as new features that simplify (and unify)
Flink handling across the API stack.
+
+**Release Highlights**
+
+* The community has added support for **efficient batch execution** in the
DataStream API. This is the next major milestone towards achieving a truly
unified runtime for both batch and stream processing.
+
+* **Kubernetes-based High Availability (HA)** was implemented as an
alternative to ZooKeeper for highly available production setups.
+
+* The Kafka SQL connector has been extended to work in **upsert mode**,
supported by the ability to handle **connector metadata** in SQL DDL.
**Temporal table joins** can now also be fully expressed in SQL, no longer
depending on the Table API.
+
+* Support for the **DataStream API in PyFlink** expands its usage to more
complex scenarios that require fine-grained control over state and time, and
it’s now possible to deploy PyFlink jobs natively on **Kubernetes**.
+
+This blog post describes all major new features and improvements, important
changes to be aware of and what to expect moving forward.
+
+{% toc %}
+
+The binary distribution and source artifacts are now available on the updated
[Downloads page]({{ site.baseurl }}/downloads.html) of the Flink website, and
the most recent distribution of PyFlink is available on
[PyPI](https://pypi.org/project/apache-flink/). Please review the [release
notes]({{ site.DOCS_BASE_URL
}}flink-docs-release-1.12/release-notes/flink-1.12.html) carefully, and check
the complete [release
changelog](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12348263&styleName=Html&projectId=12315522)
and [updated documentation]({{ site.DOCS_BASE_URL }}flink-docs-release-1.12/)
for more details.
+
+We encourage you to download the release and share your feedback with the
community through the [Flink mailing
lists](https://flink.apache.org/community.html#mailing-lists) or
[JIRA](https://issues.apache.org/jira/projects/FLINK/summary).
+
+## New Features and Improvements
+
+### Batch Execution Mode in the DataStream API
+
+Flink’s core APIs have developed organically over the lifetime of the project,
and were initially designed with specific use cases in mind. And while the
Table API/SQL already has unified operators, using lower-level abstractions
still requires you to choose between two semantically different APIs for batch
(DataSet API) and streaming (DataStream API). Since _a batch is a subset of an
unbounded stream_, there are some clear advantages to consolidating them under
a single API:
+
+* **Reusability:** efficient batch and stream processing under the same API
would allow you to easily switch between both execution modes without rewriting
any code. So, a job could be easily reused to process real-time and historical
data.
+
+* **Operational simplicity:** providing a unified API would mean using a
single set of connectors, maintaining a single codebase and being able to
easily implement mixed execution pipelines _e.g._ for use cases like
backfilling.
+
+With these advantages in mind, the community has taken the first step towards
the unification of the DataStream API: supporting efficient batch execution
([FLIP-134](https://cwiki.apache.org/confluence/display/FLINK/FLIP-134%3A+Batch+execution+for+the+DataStream+API)).
This means that, in the long run, the DataSet API will be deprecated and
subsumed by the DataStream API and the Table API/SQL
([FLIP-131](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741)).
+
+**Batch for Bounded Streams**
+
+You could already use the DataStream API to process bounded streams (_e.g._
files), with the limitation that the runtime is not “aware” that the job is
bounded. To optimize the runtime for bounded input, the new `BATCH` mode
execution uses sort-based shuffles with aggregations purely in-memory and an
improved scheduling strategy (_see [Pipelined Region
Scheduling](#pipelined-region-scheduling-flip-119)_). As a result, `BATCH` mode
execution in the DataStream API already comes very close to the performance of
the DataSet API in Flink 1.12. For more details on the performance benchmark,
check the original proposal
([FLIP-140](https://cwiki.apache.org/confluence/display/FLINK/FLIP-140%3A+Introduce+batch-style+execution+for+bounded+keyed+streams)).
+
+<center>
+ <figure>
+ <img src="{{ site.baseurl }}/img/blog/2020-12-04-release-1.12.0/1.png"
width="600px"/>
+ </figure>
+</center>
+
+<div style="line-height:60%;">
+ <br>
+</div>
+
+In Flink 1.12, the default execution mode is `STREAMING`. To configure a job
to run in `BATCH` mode, you can set the configuration when submitting a job:
+
+```bash
+bin/flink run -Dexecution.runtime-mode=BATCH examples/streaming/WordCount.jar
+```
+
+, or do it programmatically:
+
+```java
+StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
+
+env.setRuntimeMode(RuntimeMode.BATCH);
+```
+
+<div style="line-height:150%;">
+ <br>
+</div>
+
+<span class="label label-info">Note</span> <font size="3">Although the DataSet
API has not been deprecated yet, we recommend that users give preference to the
DataStream API with <code>BATCH</code> execution mode for new batch jobs, and
consider migrating existing DataSet jobs.</font>
+
+### New Data Sink API (Beta)
+
+Ensuring that connectors can work for both execution modes has already been
covered for data sources in the [previous
release](https://flink.apache.org/news/2020/07/06/release-1.11.0.html#new-data-source-api-beta),
so in Flink 1.12 the community focused on implementing a unified Data Sink API
([FLIP-143](https://cwiki.apache.org/confluence/display/FLINK/FLIP-143%3A+Unified+Sink+API)).
The new abstraction introduces a write/commit protocol and a more modular
interface where the individual components are transparently exposed to the
framework.
+
+A _Sink_ implementor will have to provide the **what** and **how**: a
[_SinkWriter_]({{ site.DOCS_BASE_URL
}}flink-docs-release-1.12/api/java/org/apache/flink/api/connector/sink/SinkWriter.html)
that writes data and outputs what needs to be committed (i.e. committables);
and a [_Committer_]({{ site.DOCS_BASE_URL
}}flink-docs-release-1.12/api/java/org/apache/flink/api/connector/sink/Committer.html)
and [_GlobalCommitter_]({{ site.DOCS_BASE_URL
}}flink-docs-release-1.12/api/java/org/apache/flink/api/connector/sink/GlobalCommitter.html)
that encapsulate how to handle the committables. The framework is responsible
for the **when** and **where**: at what time and on which machine or process to
commit.
+
+<center>
+ <figure>
+ <img src="{{ site.baseurl }}/img/blog/2020-12-04-release-1.12.0/2.png"
width="700px"/>
+ </figure>
+</center>
+
+<div style="line-height:150%;">
+ <br>
+</div>
+
+This more modular abstraction allowed to support different runtime
implementations for the `BATCH` and `STREAMING` execution modes that are
efficient for their intended purpose, but use just one, unified sink
implementation. In Flink 1.12, the [FileSink connector]({{ site.DOCS_BASE_URL
}}flink-docs-release-1.12/dev/connectors/file_sink.html) is the unified drop-in
replacement for StreamingFileSink
([FLINK-19758](https://issues.apache.org/jira/browse/FLINK-19758)). The
remaining connectors will be ported to the new interfaces in future releases.
+
+### Kubernetes High Availability (HA) Service
+
+Kubernetes provides built-in functionalities that Flink can leverage for
JobManager failover, instead of relying on
[ZooKeeper](https://zookeeper.apache.org/). To enable a “ZooKeeperless” HA
setup, the community implemented a Kubernetes HA service in Flink 1.12
([FLIP-144](https://cwiki.apache.org/confluence/x/H0V4CQ)). The service is
built on the same [base interface]({{ site.DOCS_BASE_URL
}}flink-docs-release-1.12/api/java/org/apache/flink/runtime/highavailability/HighAvailabilityServices.html)
as the ZooKeeper implementation and uses Kubernetes’
[ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/)
objects to handle all the metadata needed to recover from a JobManager failure.
For more details and examples on how to configure a highly available Kubernetes
cluster, check out the [documentation]({{ site.DOCS_BASE_URL
}}flink-docs-release-1.12/deployment/ha/kubernetes_ha.html).
+
+<div class="alert alert-info small" markdown="1">
+<b>Note:</b> This does not mean that the ZooKeeper dependency will be dropped,
just that there will be an alternative for users of Flink on Kubernetes.
+</div>
+
+<hr>
+
+### Other Improvements
+
+**Migration of existing connectors to the new Data Source API**
+
+The previous release introduced a new Data Source API
([FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface)),
allowing to implement connectors that work both as bounded (batch) and
unbounded (streaming) sources. In Flink 1.12, the community started porting
existing source connectors to the new interfaces, starting with the FileSystem
connector ([FLINK-19161](https://issues.apache.org/jira/browse/FLINK-19161)).
+
+<div class="alert alert-danger small" markdown="1">
+<b>Attention:</b> The unified source implementations will be completely
separate connectors that are not snapshot-compatible with their legacy
counterparts.
+</div>
+
+**Pipelined Region Scheduling
([FLIP-119](https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling#FLIP119PipelinedRegionScheduling-BulkSlotAllocation))**
+
+Flink’s scheduler has been largely designed to address batch and streaming
workloads separately. This release introduces a **unified** scheduling strategy
that identifies sets of tasks connected via blocking data exchanges to break
down the execution graph into _pipelined regions_. This allows to schedule each
region only when there’s data to perform work and only deploy it once all the
required resources are available; as well as to restart failed regions
independently. In particular for batch jobs, the new strategy leads to more
efficient resource utilization and eliminates deadlocks.
+
+**Support for Sort-Merge Shuffles
([FLIP-148](https://cwiki.apache.org/confluence/display/FLINK/FLIP-148%3A+Introduce+Sort-Merge+Based+Blocking+Shuffle+to+Flink))**
+
+To improve the stability, performance and resource utilization of large-scale
batch jobs, the community introduced sort-merge shuffle as an alternative to
the original shuffle implementation that Flink already used. This approach can
reduce shuffle time
[significantly](https://www.mail-archive.com/[email protected]/msg42472.html),
and uses fewer file handles and file write buffers (which is problematic for
large-scale jobs). Further optimizations will be implemented in upcoming
releases ([FLINK-19614](https://issues.apache.org/jira/browse/FLINK-19614)).
+
+<div class="alert alert-danger small" markdown="1">
+<b>Attention:</b> This feature is experimental and not enabled by default. To
enable sort-merge shuffles, you can configure a reasonable minimum parallelism
threshold in the <a
href="https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#taskmanager-network-sort-shuffle-min-parallelism">TaskManager
network configuration options</a>.
Review comment:
Thanks for catching that, @HuangXingBo.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]