Github user mattyb149 commented on a diff in the pull request:
https://github.com/apache/nifi/pull/2813#discussion_r199576036
--- Diff:
nifi-nar-bundles/nifi-data-generation-bundle/nifi-data-generation-processors/src/main/java/org/apache/nifi/processors/generation/GenerateRecord.java
---
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nifi.processors.generation;
+
+import io.confluent.avro.random.generator.Generator;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.nifi.annotation.behavior.InputRequirement;
+import org.apache.nifi.annotation.lifecycle.OnScheduled;
+import org.apache.nifi.avro.AvroTypeUtil;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.components.Validator;
+import org.apache.nifi.expression.ExpressionLanguageScope;
+import org.apache.nifi.flowfile.FlowFile;
+import org.apache.nifi.processor.AbstractProcessor;
+import org.apache.nifi.processor.ProcessContext;
+import org.apache.nifi.processor.ProcessSession;
+import org.apache.nifi.processor.Relationship;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.nifi.processor.util.StandardValidators;
+import org.apache.nifi.schema.access.SchemaNotFoundException;
+import org.apache.nifi.serialization.RecordSetWriter;
+import org.apache.nifi.serialization.RecordSetWriterFactory;
+import org.apache.nifi.serialization.record.MapRecord;
+import org.apache.nifi.serialization.record.Record;
+import org.apache.nifi.serialization.record.RecordSchema;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+import java.util.Set;
+
+@InputRequirement(InputRequirement.Requirement.INPUT_ALLOWED)
+public class GenerateRecord extends AbstractProcessor {
+ static final PropertyDescriptor WRITER = new
PropertyDescriptor.Builder()
+ .name("generate-record-writer")
+ .displayName("Record Writer")
+ .identifiesControllerService(RecordSetWriterFactory.class)
+ .description("The record writer to use for serializing generated
records.")
+ .required(true)
+ .build();
+
+ static final PropertyDescriptor SCHEMA = new
PropertyDescriptor.Builder()
--- End diff --
I appreciate that this library has a DSL that is a modified version of an
Avro schema definition, as they may need the actual Avro schema defined in the
writer as well. Does the writer support "Inherit Record Schema" so it can just
get the one generated by the library without having to specify it?
Also I appreciate the flexibility of the DSL to be able to generate data of
different types, lengths, patterns, etc. On the downside it appears to be more
about generating data in the desired structure rather than generating desired
data and putting it into the structure, meaning there is no direct support for
generating emails, SSNs, phone numbers, etc. Those would have to be done by
regex and/or using provided values.
Also the library generates Avro which we are currently converting in every
case, which seems like an unnecessary step. At the least we may want to call
getMimeType() on the writer, if it is Avro and we are inheriting the schema
(versus defining it explicitly in the writer) we might be able to skip the
"conversion" and write directly to the flow file. Not sure how much of that is
available via the API, I'm just saying it's a bummer to have to convert the
generated records. What kind of throughput are you seeing when it runs at full
speed?
Did you vet other Java libraries for data generation?
[avro-mocker](https://github.com/speedment/avro-mocker) uses the actual output
schema, then the CLI asks questions about the strategy for each field, I wonder
if we could leverage that via a separate DSL or user-defined properties.
[JFairy](https://github.com/Devskiller/jfairy) is more concerned with semantic
datatypes, but does not provide a DSL so we would likely have to do something
similar to this library in terms of an Avro-schema-based DSL or something even
simpler (if possible and prudent). All the libraries I looked at had similar
pros/cons, so if we stick with this one I'm fine with that. Would be nice to
have more examples in the additional details though, for email addresses, IPs,
phone numbers, etc.
---