zentol commented on code in PR #20757: URL: https://github.com/apache/flink/pull/20757#discussion_r996305134
########## docs/content/docs/connectors/datastream/datagen.md: ########## @@ -0,0 +1,115 @@ +--- +title: DataGen +weight: 3 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# DataGen Connector + +The DataGen connector provides a `Source` implementation that allows for generating input data for +Flink pipelines. +It is useful when developing locally or demoing without access to external systems such as Kafka. +The DataGen connector is built-in, no additional dependencies are required. + +Usage +----- + +The `DataGeneratorSource` produces N data points in parallel. The source splits the sequence +into as many parallel sub-sequences as there are parallel source subtasks. It drives the data +generation process by supplying "index" values of type Long to the user-provided +{{< javadoc name="GeneratorFunction" file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}. + +The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long values +into the generated events of an arbitrary data type. For instance, the following code will produce the sequence of +`["Number: 0", "Number: 2", ... , "Number: 999"]` records. + +```java +GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + index; + +DataGeneratorSource<String> source = + new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING); + +DataStreamSource<String> stream = + env.fromSource(source, + WatermarkStrategy.noWatermarks(), + "Generator Source"); +``` + +The order of elements depends on the parallelism. Each sub-sequence will be produced in order. +Consequently, if the parallelism is limited to one, this will produce one sequence in order from +`"Number: 0"` to `"Number: 999"`. + +`DataGeneratorSource` has built-in support for rate limiting. The following code will produce an +effectively unbounded (`Long.MAX_VALUE` from a practical perspective will never be reached) stream of +Long values at the overall source rate (across all source subtasks) not exceeding 100 events per second. + +```java +GeneratorFunction<Long, Long> generatorFunction = index -> index; + +DataGeneratorSource<String> source = + new DataGeneratorSource<>( + generatorFunctionStateless, + Long.MAX_VALUE, + RateLimiterStrategy.perSecond(100), + Types.STRING); +``` + +The source also allows for producing specific elements between the checkpoint boundaries using the Review Comment: The `ExternallyInducedSourceReader` route would add so much complexity that I'm not sure if we should pursue it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
