[GitHub] [flink] zentol commented on a diff in pull request #20757: [FLINK-27919] Add FLIP-27-based source for data generation (FLIP-238)

GitBox Thu, 13 Oct 2022 07:08:05 -0700


zentol commented on code in PR #20757:
URL: https://github.com/apache/flink/pull/20757#discussion_r994673018



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 

Review Comment:
   ```suggestion
   generation process by supplying "index" values of type `Long` to the 
user-provided 
   ```



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values

Review Comment:
   ```suggestion
   The `GeneratorFunction` is then used for mapping the (sub-)sequences of 
`Long` values
   ```



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.

Review Comment:
   ```suggestion
   `["Number: 0", "Number: 1", ... , "Number: 999"]` records.
   ```
   It's not intuitive that this is 2.



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.
+
+```java
+GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + 
index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);

Review Comment:
   Move `1000` into a separate variable for clarity; that should explain why 
the sequence ends at `999`.



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.
+
+```java
+GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + 
index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
+
+DataStreamSource<String> stream =
+        env.fromSource(source,
+        WatermarkStrategy.noWatermarks(),
+        "Generator Source");
+```
+
+The order of elements depends on the parallelism. Each sub-sequence will be 
produced in order.
+Consequently, if the parallelism is limited to one, this will produce one 
sequence in order from
+`"Number: 0"` to `"Number: 999"`.
+
+`DataGeneratorSource` has built-in support for rate limiting. The following 
code will produce an
+effectively unbounded (`Long.MAX_VALUE` from a practical perspective will 
never be reached) stream of
+Long values at the overall source rate (across all source subtasks) not 
exceeding 100 events per second.
+
+```java
+GeneratorFunction<Long, Long> generatorFunction = index -> index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(
+             generatorFunctionStateless,
+             Long.MAX_VALUE,
+             RateLimiterStrategy.perSecond(100),
+             Types.STRING);
+```
+
+The source also allows for producing specific elements between the checkpoint 
boundaries using the 
+corresponding 
+{{< javadoc name="RateLimiterStrategy" 
file="org/apache/flink/api/connector/source/util/ratelimit/RateLimiterStrategy.html">}}.
 
+This is particularly useful for testing scenarios where certain output
+is expected to be produced upon checkpoint completions. The below snippet 
illustrates an example of 
+producing the sequence of elements `"a","b", .. ,"j"` repeatedly between 
checkpoints:
+
+```java
+StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
+env.enableCheckpointing(3000);
+env.setRuntimeMode(RuntimeExecutionMode.STREAMING);

Review Comment:
   Is this important?



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.
+
+```java
+GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + 
index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
+
+DataStreamSource<String> stream =
+        env.fromSource(source,
+        WatermarkStrategy.noWatermarks(),
+        "Generator Source");
+```
+
+The order of elements depends on the parallelism. Each sub-sequence will be 
produced in order.
+Consequently, if the parallelism is limited to one, this will produce one 
sequence in order from
+`"Number: 0"` to `"Number: 999"`.
+
+`DataGeneratorSource` has built-in support for rate limiting. The following 
code will produce an
+effectively unbounded (`Long.MAX_VALUE` from a practical perspective will 
never be reached) stream of
+Long values at the overall source rate (across all source subtasks) not 
exceeding 100 events per second.
+
+```java
+GeneratorFunction<Long, Long> generatorFunction = index -> index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(
+             generatorFunctionStateless,
+             Long.MAX_VALUE,
+             RateLimiterStrategy.perSecond(100),
+             Types.STRING);
+```
+
+The source also allows for producing specific elements between the checkpoint 
boundaries using the 

Review Comment:
   I think I'd first explain that you can also apply a per-checkpoint rate 
limit, and then talk about deterministic checkpointing content.
   
   I'm wondering though if having the exact same input per checkpoint is 
interesting in the first place.
   
   Actually, maybe remove the deterministic checkpointing bit. Technically it 
isn't guaranteed to be deterministic; only if we attempt to emit more (or 
exactly as many) values than the rate allows per checkpoint, but if the 
checkpoint is triggered early then it's no longer deterministic.



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.
+
+```java
+GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + 
index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
+
+DataStreamSource<String> stream =
+        env.fromSource(source,
+        WatermarkStrategy.noWatermarks(),
+        "Generator Source");
+```
+
+The order of elements depends on the parallelism. Each sub-sequence will be 
produced in order.
+Consequently, if the parallelism is limited to one, this will produce one 
sequence in order from
+`"Number: 0"` to `"Number: 999"`.
+
+`DataGeneratorSource` has built-in support for rate limiting. The following 
code will produce an
+effectively unbounded (`Long.MAX_VALUE` from a practical perspective will 
never be reached) stream of

Review Comment:
   The boundedness should be explained separately, and not in passing while 
explaining rate-limiting.



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.
+
+```java
+GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + 
index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
+
+DataStreamSource<String> stream =
+        env.fromSource(source,
+        WatermarkStrategy.noWatermarks(),
+        "Generator Source");
+```
+
+The order of elements depends on the parallelism. Each sub-sequence will be 
produced in order.
+Consequently, if the parallelism is limited to one, this will produce one 
sequence in order from
+`"Number: 0"` to `"Number: 999"`.
+
+`DataGeneratorSource` has built-in support for rate limiting. The following 
code will produce an
+effectively unbounded (`Long.MAX_VALUE` from a practical perspective will 
never be reached) stream of
+Long values at the overall source rate (across all source subtasks) not 
exceeding 100 events per second.
+
+```java
+GeneratorFunction<Long, Long> generatorFunction = index -> index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(
+             generatorFunctionStateless,
+             Long.MAX_VALUE,
+             RateLimiterStrategy.perSecond(100),
+             Types.STRING);
+```
+
+The source also allows for producing specific elements between the checkpoint 
boundaries using the 
+corresponding 
+{{< javadoc name="RateLimiterStrategy" 
file="org/apache/flink/api/connector/source/util/ratelimit/RateLimiterStrategy.html">}}.
 
+This is particularly useful for testing scenarios where certain output
+is expected to be produced upon checkpoint completions. The below snippet 
illustrates an example of 
+producing the sequence of elements `"a","b", .. ,"j"` repeatedly between 
checkpoints:
+
+```java
+StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
+env.enableCheckpointing(3000);
+env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
+env.setParallelism(1);

Review Comment:
   We must clarify that this is important.



##########
docs/content/docs/connectors/datastream/datagen.md:
##########
@@ -0,0 +1,115 @@
+---
+title: DataGen
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataGen Connector
+
+The DataGen connector provides a `Source` implementation that allows for 
generating input data for 
+Flink pipelines.
+It is useful when developing locally or demoing without access to external 
systems such as Kafka.
+The DataGen connector is built-in, no additional dependencies are required.
+
+Usage
+-----
+
+The `DataGeneratorSource` produces N data points in parallel. The source 
splits the sequence 
+into as many parallel sub-sequences as there are parallel source subtasks. It 
drives the data 
+generation process by supplying "index" values of type Long to the 
user-provided 
+{{< javadoc name="GeneratorFunction" 
file="org/apache/flink/connector/datagen/source/GeneratorFunction.html" >}}.
+
+The `GeneratorFunction` is then used for mapping the (sub-)sequences of Long 
values
+into the generated events of an arbitrary data type. For instance, the 
following code will produce the sequence of
+`["Number: 0", "Number: 2", ... , "Number: 999"]` records.
+
+```java
+GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + 
index;
+
+DataGeneratorSource<String> source =
+        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
+
+DataStreamSource<String> stream =
+        env.fromSource(source,
+        WatermarkStrategy.noWatermarks(),
+        "Generator Source");
+```
+
+The order of elements depends on the parallelism. Each sub-sequence will be 
produced in order.
+Consequently, if the parallelism is limited to one, this will produce one 
sequence in order from
+`"Number: 0"` to `"Number: 999"`.
+
+`DataGeneratorSource` has built-in support for rate limiting. The following 
code will produce an

Review Comment:
   Let's add a dedicated "Rate limiting" section.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] zentol commented on a diff in pull request #20757: [FLINK-27919] Add FLIP-27-based source for data generation (FLIP-238)

Reply via email to