echauchot commented on code in PR #641: URL: https://github.com/apache/flink-web/pull/641#discussion_r1171427168
########## docs/content/posts/2023-04-13-howto-create-batch-source.md: ########## @@ -0,0 +1,221 @@ +--- +title: "Howto create a batch source with the new Source framework" +date: "2023-04-13T08:00:00.000Z" +authors: +- echauchot: + name: "Etienne Chauchot" + twitter: "echauchot" +aliases: +- /2023/04/13/howto-create-batch-source.html +--- + +## Introduction + +Flink community has +designed [a new Source framework](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/sources/) +based +on [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface) +lately. Some connectors have migrated to this new framework. This article is an howto create a batch +source using this new framework. It was built while implementing +the [Flink batch source](https://github.com/apache/flink-connector-cassandra/commit/72e3bef1fb9ee6042955b5e9871a9f70a8837cca) +for [Cassandra](https://cassandra.apache.org/_/index.html). I felt it could be useful to people +interested in contributing or migrating connectors. + +## Implementing the source components + +The aim here is not to duplicate the official documentation. For details you should read the +documentation, the javadocs or the Cassandra connector code. The links are above. The goal here is +to give field feedback on how to implement the different components. + +The source architecture is depicted in the diagrams below: + + + + + +### Source +[example Cassandra Source](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/CassandraSource.java) + +The source interface only does the "glue" between all the other components. Its role is to +instantiate all of them and to define the source [Boundedness](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/Boundedness.html). We also do the source configuration +here along with user configuration validation. + +### SourceReader +[example Cassandra SourceReader](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/reader/CassandraSourceReader.java) + +As shown in the graphic above, the instances of the [SourceReader](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/SourceReader.html) (which we will call simply readers +in the continuation of this article) run in parallel in task managers to read the actual data which +is divided into [Splits](#split-and-splitstate). Readers request splits from the [SplitEnumerator](#splitenumerator-and-splitenumeratorstate) and the resulting splits are +assigned to them in return. + +Luckily for us Flink provides the [SourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SourceReaderBase.html) implementation that takes care of the +synchronization between the readers and the main thread. Flink also provides a useful extension to +this class for most cases: [SingleThreadMultiplexSourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SingleThreadMultiplexSourceReaderBase.html). This class avoids having to +specify the threading model as it is already configured to read splits with one thread using one +[SplitReader](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/splitreader/SplitReader.html). This is not harmful to the performance of the source as the SourceReaders run in +parallel among the task managers. Review Comment: I agree it needs clearer phrasing. I propose "each reader instance reads splits using one thread but there are several reader instances that live among task managers". ########## docs/content/posts/2023-04-13-howto-create-batch-source.md: ########## @@ -0,0 +1,221 @@ +--- +title: "Howto create a batch source with the new Source framework" +date: "2023-04-13T08:00:00.000Z" +authors: +- echauchot: + name: "Etienne Chauchot" + twitter: "echauchot" +aliases: +- /2023/04/13/howto-create-batch-source.html +--- + +## Introduction + +Flink community has +designed [a new Source framework](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/sources/) +based +on [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface) +lately. Some connectors have migrated to this new framework. This article is an howto create a batch +source using this new framework. It was built while implementing +the [Flink batch source](https://github.com/apache/flink-connector-cassandra/commit/72e3bef1fb9ee6042955b5e9871a9f70a8837cca) +for [Cassandra](https://cassandra.apache.org/_/index.html). I felt it could be useful to people +interested in contributing or migrating connectors. + +## Implementing the source components + +The aim here is not to duplicate the official documentation. For details you should read the +documentation, the javadocs or the Cassandra connector code. The links are above. The goal here is +to give field feedback on how to implement the different components. + +The source architecture is depicted in the diagrams below: + + + + + +### Source +[example Cassandra Source](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/CassandraSource.java) + +The source interface only does the "glue" between all the other components. Its role is to +instantiate all of them and to define the source [Boundedness](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/Boundedness.html). We also do the source configuration +here along with user configuration validation. + +### SourceReader +[example Cassandra SourceReader](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/reader/CassandraSourceReader.java) + +As shown in the graphic above, the instances of the [SourceReader](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/SourceReader.html) (which we will call simply readers +in the continuation of this article) run in parallel in task managers to read the actual data which +is divided into [Splits](#split-and-splitstate). Readers request splits from the [SplitEnumerator](#splitenumerator-and-splitenumeratorstate) and the resulting splits are +assigned to them in return. + +Luckily for us Flink provides the [SourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SourceReaderBase.html) implementation that takes care of the +synchronization between the readers and the main thread. Flink also provides a useful extension to +this class for most cases: [SingleThreadMultiplexSourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SingleThreadMultiplexSourceReaderBase.html). This class avoids having to +specify the threading model as it is already configured to read splits with one thread using one +[SplitReader](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/splitreader/SplitReader.html). This is not harmful to the performance of the source as the SourceReaders run in +parallel among the task managers. Review Comment: done ########## docs/content/posts/2023-04-13-howto-create-batch-source.md: ########## @@ -0,0 +1,221 @@ +--- +title: "Howto create a batch source with the new Source framework" +date: "2023-04-13T08:00:00.000Z" +authors: +- echauchot: + name: "Etienne Chauchot" + twitter: "echauchot" +aliases: +- /2023/04/13/howto-create-batch-source.html +--- + +## Introduction + +Flink community has +designed [a new Source framework](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/sources/) +based +on [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface) +lately. Some connectors have migrated to this new framework. This article is an howto create a batch +source using this new framework. It was built while implementing +the [Flink batch source](https://github.com/apache/flink-connector-cassandra/commit/72e3bef1fb9ee6042955b5e9871a9f70a8837cca) +for [Cassandra](https://cassandra.apache.org/_/index.html). I felt it could be useful to people +interested in contributing or migrating connectors. + +## Implementing the source components + +The aim here is not to duplicate the official documentation. For details you should read the +documentation, the javadocs or the Cassandra connector code. The links are above. The goal here is +to give field feedback on how to implement the different components. + +The source architecture is depicted in the diagrams below: + + + + + +### Source +[example Cassandra Source](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/CassandraSource.java) + +The source interface only does the "glue" between all the other components. Its role is to +instantiate all of them and to define the source [Boundedness](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/Boundedness.html). We also do the source configuration +here along with user configuration validation. + +### SourceReader +[example Cassandra SourceReader](https://github.com/apache/flink-connector-cassandra/blob/d92dc8d891098a9ca6a7de6062b4630079beaaef/flink-connector-cassandra/src/main/java/org/apache/flink/connector/cassandra/source/reader/CassandraSourceReader.java) + +As shown in the graphic above, the instances of the [SourceReader](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/SourceReader.html) (which we will call simply readers +in the continuation of this article) run in parallel in task managers to read the actual data which +is divided into [Splits](#split-and-splitstate). Readers request splits from the [SplitEnumerator](#splitenumerator-and-splitenumeratorstate) and the resulting splits are +assigned to them in return. + +Luckily for us Flink provides the [SourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SourceReaderBase.html) implementation that takes care of the +synchronization between the readers and the main thread. Flink also provides a useful extension to +this class for most cases: [SingleThreadMultiplexSourceReaderBase](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/base/source/reader/SingleThreadMultiplexSourceReaderBase.html). This class avoids having to +specify the threading model as it is already configured to read splits with one thread using one Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
