echauchot commented on code in PR #641:
URL: https://github.com/apache/flink-web/pull/641#discussion_r1171427168


##########
docs/content/posts/2023-04-13-howto-create-batch-source.md:
##########
@@ -0,0 +1,221 @@
+---
+title:  "Howto create a batch source with the new Source framework"
+date: "2023-04-13T08:00:00.000Z"
+authors:
+- echauchot:
+  name: "Etienne Chauchot"
+  twitter: "echauchot"
+aliases:
+- /2023/04/13/howto-create-batch-source.html
+---
+
+## Introduction
+
+Flink community has
+designed [a new Source 
framework](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/sources/)
+based
+on 
[FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface)
+lately. Some connectors have migrated to this new framework. This article is 
an howto create a batch
+source using this new framework. It was built while implementing
+the [Flink batch 
source](https://github.com/apache/flink-connector-cassandra/commit/72e3bef1fb9ee6042955b5e9871a9f70a8837cca)
+for [Cassandra](https://cassandra.apache.org/_/index.html). I felt it could be 
useful to people
+interested in contributing or migrating connectors.
+
+## Implementing the source components
+
+The aim here is not to duplicate the official documentation. For details you 
should read the
+documentation, the javadocs or the Cassandra connector code. The links are 
above. The goal here is
+to give field feedback on how to implement the different components.
+
+The source architecture is depicted in the diagrams below:
+
+![](/img/blog/2023-04-13-howto-create-batch-source/source_components.svg)
+
+![](/img/blog/2023-04-13-howto-create-batch-source/source_reader.svg)
+
+### Source
+[example Cassandra 
Source](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/CassandraSource.java)
+
+The source interface only does the "glue" between all the other components. 
Its role is to
+instantiate all of them and to define the source 
[Boundedness](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/Boundedness.html).
 We also do the source configuration
+here along with user configuration validation.
+
+### SourceReader
+[example Cassandra 
SourceReader](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/reader/CassandraSourceReader.java)
+
+As shown in the graphic above, the instances of the 
[SourceReader](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/SourceReader.html)
 (which we will call simply readers
+in the continuation of this article) run in parallel in task managers to read 
the actual data which
+is divided into [Splits](#split-and-splitstate). Readers request splits from 
the [SplitEnumerator](#splitenumerator-and-splitenumeratorstate) and the 
resulting splits are
+assigned to them in return.
+
+Luckily for us Flink provides the 
[SourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SourceReaderBase.html)
 implementation that takes care of the
+synchronization between the readers and the main thread. Flink also provides a 
useful extension to
+this class for most cases: 
[SingleThreadMultiplexSourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SingleThreadMultiplexSourceReaderBase.html).
 This class avoids having to
+specify the threading model as it is already configured to read splits with 
one thread using one
+[SplitReader](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/splitreader/SplitReader.html).
 This is not harmful to the performance of the source as the SourceReaders run 
in
+parallel among the task managers.

Review Comment:
   I agree it needs clearer phrasing. I propose "each reader instance reads 
splits using one thread (but there are several reader instances that live among 
task managers)". 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to