MarkSfik commented on a change in pull request #403: URL: https://github.com/apache/flink-web/pull/403#discussion_r551875081
########## File path: _posts/2020-12-22-pulsar-flink-connector-270.md ########## @@ -0,0 +1,171 @@ +--- +layout: post +title: "What's New in the Pulsar Flink Connector 2.7.0" +date: 2020-12-22T08:00:00.000Z +categories: news +authors: +- jianyun: + name: "Jianyun Zhao" + twitter: "yihy8023" +- jennifer: + name: "Jennifer Huang" + twitter: "Jennife06125739" + +excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing. +--- + +## About the Pulsar Flink Connector +In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a unified data architecture for real-time data-driven businesses. + +The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details. + +## Challenges +When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. + +As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing. + +To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration. + +## What’s New in Pulsar Flink Connector 2.7.0? +The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. + +Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail. + +### Ordered message queue with high-performance +When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. + +<br> +<div class="row front-graphic"> + <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/> +</div> + +Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks. + + +### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction) +In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly. + +Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). + +You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink. + +It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. Review comment: ```suggestion It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, while any pre-committed messages will be committed eventually. ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
