Yes, both of your examples help explain the use of the CSV header parsing. I think I have a much better understanding of the new framework now, thanks to Bryan and Matt. Mission accomplished!
On Fri, May 19, 2017 at 7:04 PM, Bryan Bende <[email protected]> wrote: > When a reader produces a record it attaches the schema it used to the > record, but we currently don't have a way for a writer to use that > schema when writing a record, although I think we do want to support > that... something like a "Use Schema in Record" option as a choice in > the 'Schema Access Strategy' of writers. > > For now, when a processor uses a reader and a writer, and you also > want to read and write with the same schema, then you would still have > to define the same schema for the writer to use even if you had a CSV > reader that inferred the schema from the headers. > > There are some processors that only use a reader, like > PutDabaseRecord, where using the CSV header would still be helpful. > > There are also a lot of cases where you where you would write with a > different schema then you read with, so using the CSV header for > reading is still helpful in those cases too. > > Hopefully I am making sense and not confusing things more. > > > On Fri, May 19, 2017 at 1:27 PM, Joe Gresock <[email protected]> wrote: > > Matt, > > > > Great response, this does help explain a lot. Reading through your post > > made me realize I didn't understand the AvroSchemaRegistry. I had been > > thinking it was something that nifi processors populated, but I re-read > its > > usage description and it does indeed say to use dynamic properties for > the > > schema name / value. In that case, I can definitely see how this is not > > dynamic in the sense of inferring any schemas on the flow. It makes me > > wonder if there would be a way to populate the schema registry from flow > > files. When I first glanced at the processors, I had assumed that when > the > > schema was inferred from the CSV headers, it would create an entry in the > > AvroSchemaRegistry, provided you filled in the correct properties. > Clearly > > this is not how it works. > > > > Just so I understand, does your first paragraph mean that even if you use > > the CSV headers to determine the schema, you still can't use the *Record > > processors unless you manually register a matching schema in the schema > > registry, or otherwise somehow set the schema in an attribute? In this > > case, it almost seems like inferring the schema from the CSV headers > serves > > no purpose, and I don't see how NIFI-3921 would alleviate that (it only > > appears to address avro flow files with embedded schema). > > > > Based on this understanding, I was able to successfully get the following > > flow working: > > InferAvroSchema -> QueryRecord. > > > > QueryRecord uses CSVReader with "Use Schema Text Property" and Schema > Text > > set to ${inferred.avro.schema} (which is populated by the InferAvroSchema > > processor). It also uses JsonRecordSetWriter with a similar > > configuration. I could attach a template, but I don't know the best way > to > > do that on the listserve. > > > > Joe > > > > On Fri, May 19, 2017 at 4:59 PM, Matt Burgess <[email protected]> > wrote: > > > >> Joe, > >> > >> Using the CSV Headers to determine the schema is currently the only > >> "dynamic" schema strategy, so it will be tricky to use with the other > >> Readers/Writers and associated processors (which require an explicit > >> schema). This should be alleviated with NIFI-3291 [1]. For this first > >> release, I believe the approach would be to identify the various > >> schemas for your incoming/outgoing data, create a Schema Registry with > >> all of them, then the various Record Readers/Writers using those. > >> > >> There were some issues during development related to using the > >> incoming vs outgoing schema for various record operations, if > >> QueryRecord seems to be using the output schema for querying then it > >> is likely a bug. However in this case it might just be that you need > >> an explicit schema for your Writer that matches the input schema > >> (which is inferred from the CSV header). The CSV Header inference > >> currently assumes all fields are Strings, so a nominal schema would > >> have the same number of fields as columns, each with type String. If > >> you don't know the number of columns and/or the column names are > >> dynamic per CSV file, I believe we have a gap here (for now). > >> > >> I thought of maybe having a InferRecordSchema processor that would > >> fill in the avro.text attribute for use in various downstream record > >> readers/writers, but inferring schemas in general is not a trivial > >> task. An easier interim solution might be to have an > >> AddSchemaAsAttribute processor, which takes a Reader to parse the > >> records and determine the schema (whether dynamic or static), and set > >> the avro.text attribute on the original incoming flow file, then > >> transfer the original flow file. This would require two reads, one by > >> AddSchemaAsAttribute and one by the downstream record processor, but > >> it should be fairly easy to implement. Then again, since new features > >> would go into 1.3.0, hopefully NIFI-3921 will be implemented by then, > >> rendering all this moot :) > >> > >> Regards, > >> Matt > >> > >> [1] https://issues.apache.org/jira/browse/NIFI-3921 > >> > >> On Fri, May 19, 2017 at 12:25 PM, Joe Gresock <[email protected]> > wrote: > >> > I've tried a couple different configurations of CSVReader / > >> > JsonRecordSetWriter with the QueryRecord processor, and I don't think > I > >> > quite have the usage down yet. > >> > > >> > Can someone give a basic example configuration in the following 2 > >> > scenarios? I followed most of Matt Burgess's response to the post > titled > >> > "How to use ConvertRecord Processor", but I don't think it tells the > >> whole > >> > story. > >> > > >> > 1) QueryRecord, converting CSV to JSON, using only the CSV headers to > >> > determine the schema. (I tried selecting Use String Fields from > Header > >> in > >> > CSVReader, but the processor really seems to want to use the > >> > JsonRecordSetWriter to determine the schema) > >> > > >> > 2) QueryRecord, converting CSV to JSON, using a cached avro schema. I > >> > imagine I need to use InferAvroSchema here, but I'm not sure how to > cache > >> > it in the AvroSchemaRegistry. Also not quite sure how to configure > the > >> > properties of each controller service in this case. > >> > > >> > Any help would be appreciated. > >> > > >> > Joe > >> > > >> > -- > >> > I know what it is to be in need, and I know what it is to have > plenty. I > >> > have learned the secret of being content in any and every situation, > >> > whether well fed or hungry, whether living in plenty or in want. I > can > >> do > >> > all this through him who gives me strength. *-Philippians 4:12-13* > >> > > > > > > > > -- > > I know what it is to be in need, and I know what it is to have plenty. I > > have learned the secret of being content in any and every situation, > > whether well fed or hungry, whether living in plenty or in want. I can > do > > all this through him who gives me strength. *-Philippians 4:12-13* > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
