Joe,

Using the CSV Headers to determine the schema is currently the only
"dynamic" schema strategy, so it will be tricky to use with the other
Readers/Writers and associated processors (which require an explicit
schema). This should be alleviated with NIFI-3291 [1].  For this first
release, I believe the approach would be to identify the various
schemas for your incoming/outgoing data, create a Schema Registry with
all of them, then the various Record Readers/Writers using those.

There were some issues during development related to using the
incoming vs outgoing schema for various record operations, if
QueryRecord seems to be using the output schema for querying then it
is likely a bug. However in this case it might just be that you need
an explicit schema for your Writer that matches the input schema
(which is inferred from the CSV header). The CSV Header inference
currently assumes all fields are Strings, so a nominal schema would
have the same number of fields as columns, each with type String. If
you don't know the number of columns and/or the column names are
dynamic per CSV file, I believe we have a gap here (for now).

I thought of maybe having a InferRecordSchema processor that would
fill in the avro.text attribute for use in various downstream record
readers/writers, but inferring schemas in general is not a trivial
task. An easier interim solution might be to have an
AddSchemaAsAttribute processor, which takes a Reader to parse the
records and determine the schema (whether dynamic or static), and set
the avro.text attribute on the original incoming flow file, then
transfer the original flow file. This would require two reads, one by
AddSchemaAsAttribute and one by the downstream record processor, but
it should be fairly easy to implement.  Then again, since new features
would go into 1.3.0, hopefully NIFI-3921 will be implemented by then,
rendering all this moot :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-3921

On Fri, May 19, 2017 at 12:25 PM, Joe Gresock <[email protected]> wrote:
> I've tried a couple different configurations of CSVReader /
> JsonRecordSetWriter with the QueryRecord processor, and I don't think I
> quite have the usage down yet.
>
> Can someone give a basic example configuration in the following 2
> scenarios?  I followed most of Matt Burgess's response to the post titled
> "How to use ConvertRecord Processor", but I don't think it tells the whole
> story.
>
> 1) QueryRecord, converting CSV to JSON, using only the CSV headers to
> determine the schema.  (I tried selecting Use String Fields from Header in
> CSVReader, but the processor really seems to want to use the
> JsonRecordSetWriter to determine the schema)
>
> 2) QueryRecord, converting CSV to JSON, using a cached avro schema.  I
> imagine I need to use InferAvroSchema here, but I'm not sure how to cache
> it in the AvroSchemaRegistry.  Also not quite sure how to configure the
> properties of each controller service in this case.
>
> Any help would be appreciated.
>
> Joe
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*

Reply via email to