Github user mattyb149 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2813#discussion_r199576036
  
    --- Diff: 
nifi-nar-bundles/nifi-data-generation-bundle/nifi-data-generation-processors/src/main/java/org/apache/nifi/processors/generation/GenerateRecord.java
 ---
    @@ -0,0 +1,209 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.nifi.processors.generation;
    +
    +import io.confluent.avro.random.generator.Generator;
    +import org.apache.avro.Schema;
    +import org.apache.avro.generic.GenericData;
    +import org.apache.nifi.annotation.behavior.InputRequirement;
    +import org.apache.nifi.annotation.lifecycle.OnScheduled;
    +import org.apache.nifi.avro.AvroTypeUtil;
    +import org.apache.nifi.components.PropertyDescriptor;
    +import org.apache.nifi.components.Validator;
    +import org.apache.nifi.expression.ExpressionLanguageScope;
    +import org.apache.nifi.flowfile.FlowFile;
    +import org.apache.nifi.processor.AbstractProcessor;
    +import org.apache.nifi.processor.ProcessContext;
    +import org.apache.nifi.processor.ProcessSession;
    +import org.apache.nifi.processor.Relationship;
    +import org.apache.nifi.processor.exception.ProcessException;
    +import org.apache.nifi.processor.util.StandardValidators;
    +import org.apache.nifi.schema.access.SchemaNotFoundException;
    +import org.apache.nifi.serialization.RecordSetWriter;
    +import org.apache.nifi.serialization.RecordSetWriterFactory;
    +import org.apache.nifi.serialization.record.MapRecord;
    +import org.apache.nifi.serialization.record.Record;
    +import org.apache.nifi.serialization.record.RecordSchema;
    +
    +import java.util.ArrayList;
    +import java.util.Collections;
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Random;
    +import java.util.Set;
    +
    +@InputRequirement(InputRequirement.Requirement.INPUT_ALLOWED)
    +public class GenerateRecord extends AbstractProcessor {
    +    static final PropertyDescriptor WRITER = new 
PropertyDescriptor.Builder()
    +        .name("generate-record-writer")
    +        .displayName("Record Writer")
    +        .identifiesControllerService(RecordSetWriterFactory.class)
    +        .description("The record writer to use for serializing generated 
records.")
    +        .required(true)
    +        .build();
    +
    +    static final PropertyDescriptor SCHEMA = new 
PropertyDescriptor.Builder()
    --- End diff --
    
    I appreciate that this library has a DSL that is a modified version of an 
Avro schema definition, as they may need the actual Avro schema defined in the 
writer as well. Does the writer support "Inherit Record Schema" so it can just 
get the one generated by the library without having to specify it?
    
    Also I appreciate the flexibility of the DSL to be able to generate data of 
different types, lengths, patterns, etc. On the downside it appears to be more 
about generating data in the desired structure rather than generating desired 
data and putting it into the structure, meaning there is no direct support for 
generating emails, SSNs, phone numbers, etc. Those would have to be done by 
regex and/or using provided values.
    
    Also the library generates Avro which we are currently converting in every 
case, which seems like an unnecessary step. At the least we may want to call 
getMimeType() on the writer, if it is Avro and we are inheriting the schema 
(versus defining it explicitly in the writer) we might be able to skip the 
"conversion" and write directly to the flow file. Not sure how much of that is 
available via the API, I'm just saying it's a bummer to have to convert the 
generated records. What kind of throughput are you seeing when it runs at full 
speed?
    
    Did you vet other Java libraries for data generation? 
[avro-mocker](https://github.com/speedment/avro-mocker) uses the actual output 
schema, then the CLI asks questions about the strategy for each field, I wonder 
if we could leverage that via a separate DSL or user-defined properties. 
[JFairy](https://github.com/Devskiller/jfairy) is more concerned with semantic 
datatypes, but does not provide a DSL so we would likely have to do something 
similar to this library in terms of an Avro-schema-based DSL or something even 
simpler (if possible and prudent). All the libraries I looked at had similar 
pros/cons, so if we stick with this one I'm fine with that. Would be nice to 
have more examples in the additional details though, for email addresses, IPs, 
phone numbers, etc.



---

Reply via email to