zentol commented on code in PR #20757:
URL: https://github.com/apache/flink/pull/20757#discussion_r995657759


##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.
+
+```java
+GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + 
index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
+
+DataStreamSource<String> stream =
+        env.fromSource(source,
+        WatermarkStrategy.noWatermarks(),
+        "Generator Source");
+```
+
+The order of elements depends on the parallelism. Each sub-sequence will be 
produced in order.
+Consequently, if the parallelism is limited to one, this will produce one 
sequence in order from
+`"Number: 0"` to `"Number: 999"`.
+
+`DataGeneratorSource` has built-in support for rate limiting. The following 
code will produce an
+effectively unbounded (`Long.MAX_VALUE` from a practical perspective will 
never be reached) stream of
+Long values at the overall source rate (across all source subtasks) not 
exceeding 100 events per second.
+
+```java
+GeneratorFunction<Long, Long> generatorFunction = index -> index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(
+             generatorFunctionStateless,
+             Long.MAX_VALUE,
+             RateLimiterStrategy.perSecond(100),
+             Types.STRING);
+```
+
+The source also allows for producing specific elements between the checkpoint 
boundaries using the 

Review Comment:
   > Do we actually have such examples?
   
   I haven't seen one; but they may exist :sweat_smile: 
   
   > Is this something that could be compared with the example that Steven 
shared 
[apache/iceberg/flink/source/BoundedTestSource.java#L70](https://github.com/apache/iceberg/blob/master/flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/source/BoundedTestSource.java#L70)
 where data is emitted while holding the checkpointLock ?
   
   Exactly. There they control the _exact_ contents of each checkpoint.
   
   > It seems, though, that it should not prevent the new Source from being 
used as a replacement for the BoundedTestSource mentioned above with such 
reasonable settings, what do you think?
   
   Yes-ish; provided that the checkpoint interval is large enough there is a 
reasonably high change that it won't be an issue.
   
   Can the new sources even block checkpointing? I guess the only way to do 
that is by emitting multiple values in `pollNext`. But I'm not sure if there 
even is a guarantee that pollNext is called at least once between checkpoints.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to