I've updated the JIRA description to cover not only embedded Avro schema but also ones such as derived from CSVReader. https://issues.apache.org/jira/browse/NIFI-3921
Thanks, Koji On Sat, May 20, 2017 at 4:14 AM, Joe Gresock <[email protected]> wrote: > Yes, both of your examples help explain the use of the CSV header parsing. > > I think I have a much better understanding of the new framework now, thanks > to Bryan and Matt. Mission accomplished! > > On Fri, May 19, 2017 at 7:04 PM, Bryan Bende <[email protected]> wrote: > >> When a reader produces a record it attaches the schema it used to the >> record, but we currently don't have a way for a writer to use that >> schema when writing a record, although I think we do want to support >> that... something like a "Use Schema in Record" option as a choice in >> the 'Schema Access Strategy' of writers. >> >> For now, when a processor uses a reader and a writer, and you also >> want to read and write with the same schema, then you would still have >> to define the same schema for the writer to use even if you had a CSV >> reader that inferred the schema from the headers. >> >> There are some processors that only use a reader, like >> PutDabaseRecord, where using the CSV header would still be helpful. >> >> There are also a lot of cases where you where you would write with a >> different schema then you read with, so using the CSV header for >> reading is still helpful in those cases too. >> >> Hopefully I am making sense and not confusing things more. >> >> >> On Fri, May 19, 2017 at 1:27 PM, Joe Gresock <[email protected]> wrote: >> > Matt, >> > >> > Great response, this does help explain a lot. Reading through your post >> > made me realize I didn't understand the AvroSchemaRegistry. I had been >> > thinking it was something that nifi processors populated, but I re-read >> its >> > usage description and it does indeed say to use dynamic properties for >> the >> > schema name / value. In that case, I can definitely see how this is not >> > dynamic in the sense of inferring any schemas on the flow. It makes me >> > wonder if there would be a way to populate the schema registry from flow >> > files. When I first glanced at the processors, I had assumed that when >> the >> > schema was inferred from the CSV headers, it would create an entry in the >> > AvroSchemaRegistry, provided you filled in the correct properties. >> Clearly >> > this is not how it works. >> > >> > Just so I understand, does your first paragraph mean that even if you use >> > the CSV headers to determine the schema, you still can't use the *Record >> > processors unless you manually register a matching schema in the schema >> > registry, or otherwise somehow set the schema in an attribute? In this >> > case, it almost seems like inferring the schema from the CSV headers >> serves >> > no purpose, and I don't see how NIFI-3921 would alleviate that (it only >> > appears to address avro flow files with embedded schema). >> > >> > Based on this understanding, I was able to successfully get the following >> > flow working: >> > InferAvroSchema -> QueryRecord. >> > >> > QueryRecord uses CSVReader with "Use Schema Text Property" and Schema >> Text >> > set to ${inferred.avro.schema} (which is populated by the InferAvroSchema >> > processor). It also uses JsonRecordSetWriter with a similar >> > configuration. I could attach a template, but I don't know the best way >> to >> > do that on the listserve. >> > >> > Joe >> > >> > On Fri, May 19, 2017 at 4:59 PM, Matt Burgess <[email protected]> >> wrote: >> > >> >> Joe, >> >> >> >> Using the CSV Headers to determine the schema is currently the only >> >> "dynamic" schema strategy, so it will be tricky to use with the other >> >> Readers/Writers and associated processors (which require an explicit >> >> schema). This should be alleviated with NIFI-3291 [1]. For this first >> >> release, I believe the approach would be to identify the various >> >> schemas for your incoming/outgoing data, create a Schema Registry with >> >> all of them, then the various Record Readers/Writers using those. >> >> >> >> There were some issues during development related to using the >> >> incoming vs outgoing schema for various record operations, if >> >> QueryRecord seems to be using the output schema for querying then it >> >> is likely a bug. However in this case it might just be that you need >> >> an explicit schema for your Writer that matches the input schema >> >> (which is inferred from the CSV header). The CSV Header inference >> >> currently assumes all fields are Strings, so a nominal schema would >> >> have the same number of fields as columns, each with type String. If >> >> you don't know the number of columns and/or the column names are >> >> dynamic per CSV file, I believe we have a gap here (for now). >> >> >> >> I thought of maybe having a InferRecordSchema processor that would >> >> fill in the avro.text attribute for use in various downstream record >> >> readers/writers, but inferring schemas in general is not a trivial >> >> task. An easier interim solution might be to have an >> >> AddSchemaAsAttribute processor, which takes a Reader to parse the >> >> records and determine the schema (whether dynamic or static), and set >> >> the avro.text attribute on the original incoming flow file, then >> >> transfer the original flow file. This would require two reads, one by >> >> AddSchemaAsAttribute and one by the downstream record processor, but >> >> it should be fairly easy to implement. Then again, since new features >> >> would go into 1.3.0, hopefully NIFI-3921 will be implemented by then, >> >> rendering all this moot :) >> >> >> >> Regards, >> >> Matt >> >> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3921 >> >> >> >> On Fri, May 19, 2017 at 12:25 PM, Joe Gresock <[email protected]> >> wrote: >> >> > I've tried a couple different configurations of CSVReader / >> >> > JsonRecordSetWriter with the QueryRecord processor, and I don't think >> I >> >> > quite have the usage down yet. >> >> > >> >> > Can someone give a basic example configuration in the following 2 >> >> > scenarios? I followed most of Matt Burgess's response to the post >> titled >> >> > "How to use ConvertRecord Processor", but I don't think it tells the >> >> whole >> >> > story. >> >> > >> >> > 1) QueryRecord, converting CSV to JSON, using only the CSV headers to >> >> > determine the schema. (I tried selecting Use String Fields from >> Header >> >> in >> >> > CSVReader, but the processor really seems to want to use the >> >> > JsonRecordSetWriter to determine the schema) >> >> > >> >> > 2) QueryRecord, converting CSV to JSON, using a cached avro schema. I >> >> > imagine I need to use InferAvroSchema here, but I'm not sure how to >> cache >> >> > it in the AvroSchemaRegistry. Also not quite sure how to configure >> the >> >> > properties of each controller service in this case. >> >> > >> >> > Any help would be appreciated. >> >> > >> >> > Joe >> >> > >> >> > -- >> >> > I know what it is to be in need, and I know what it is to have >> plenty. I >> >> > have learned the secret of being content in any and every situation, >> >> > whether well fed or hungry, whether living in plenty or in want. I >> can >> >> do >> >> > all this through him who gives me strength. *-Philippians 4:12-13* >> >> >> > >> > >> > >> > -- >> > I know what it is to be in need, and I know what it is to have plenty. I >> > have learned the secret of being content in any and every situation, >> > whether well fed or hungry, whether living in plenty or in want. I can >> do >> > all this through him who gives me strength. *-Philippians 4:12-13* >> > > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can do > all this through him who gives me strength. *-Philippians 4:12-13*
