Hello, If you're data is only CSV, you might want to look at ValidateCSV processor. Using QueryRecord processor would also give you options to validate your data with your own constraints.
Pierre 2018-06-18 14:52 GMT+02:00 Bryan Bende <[email protected]>: > Hello, > > In general you probably want to take a look at the "record" processors > which will offer a more efficient way of performing this task without > needing to split to 1 message per flow file. > > The flow with the record processors would probably be GetFile -> > ConvertRecord (using CsvReader and AvroWriter) -> PublishKafkaRecord > > Regarding your specific questions... > > 1) All split processors write a standard set of "fragment" attributes > which you can read about in the documentation of the processor. The > fragment.identifier will be a unique id for the overall flow file and > then fragment.index will be the index of the split with in the given > fragment.identifier. > > 2) I think you will need to write a custom script or processor for > this validation part. I suppose there could be a generic > ValidateFieldLength processor, but it doesn't seem like a common case, > and it only applies to fields that are strings which is a small > sub-set of the possible types. > > -Bryan > > > > On Mon, Jun 18, 2018 at 12:41 AM, Dave <[email protected]> wrote: > > Hi, > > > > I am learning NiFi. > > > > I have created an input csv (CityCode.csv) file as below: > > ID, CITY_NAME, ZIP_CD, STATE_CD > > 1, Delhi, 110001, DL > > 2, Mumbai, 400001, MH > > 3, Chennai, 600001, TN > > 4, Bangalore, 560001, KA > > > > This is my 1st dataflow. I am building it block by block and I am > planning > > to create a dataflow like this. > > GetFile -> InferAvroSchema -> SplitText -> ConvertCSVToAvro -> > ExtractText > > -> if error Put in Kafka > > > -> if success put in DB > > I might add few more functionalities in between to strengthen my > knowledge. > > > > InitialFlow.jpg > > <http://apache-nifi-developer-list.39713.n7.nabble.com/file/ > t1006/InitialFlow.jpg> > > > > I have created dataflow till ConvertCSVToAvro. I have a few queries in > the > > flow till now > > > > I use Getfile processor to take a csv file from a directory > > D:\ApacheNiFi\source-data. If getfile is successful, then the flow moves > to > > “CreateInferAvroSchema” > > In InferAvroSchema processor, the flow is configured as below: > > > > • Schema Output Destination - flowfile-attribute > > • Input Content Type - CSV > > • CSV Header Definition - > > • Get CSV Header Definition From Data - true > > • CSV Header Line Skip Count – 1 > > • CSV delimiter – . > > • CSV Escape String - / > > • CSV Quote String – ‘ > > • Pretty Avro Output - true > > • Avro Record Name - CityCode > > • Numer of Records To Analyze - 10 > > • Charset – UTF8 > > > > Scheduling > > Scheduling Strategy - Timer Driven, Concurrent Tasks – 1, Run Schedule > – 0 > > sec > > Settings > > • I have checked original Relationship to Automatically Terminate > > Relationships because I am not able to understand what exactly is this > > relationship > > • Failure & Unsupported content – Put in file in directory > > “D:\ApacheNiFi\error-data” > > • Success – SplitText > > > > The reason why I used SplitText processor before InferAvroSchema > processor > > is that the schema processor is not able to capture records which are > only > > failure but send the whole file and add an attribute “error” to failed > > records. In one specific post, it was recommended to first split the > records > > and then convert to avro > > https://stackoverflow.com/questions/41840726/nifi- > convertcsvtoavro-how-to-capture-the-failed-records > > <https://stackoverflow.com/questions/41840726/nifi- > convertcsvtoavro-how-to-capture-the-failed-records> > > > > In SplitText Processor, the flow is configured as below: > > Line Split Count - 1 > > Header Line Count - 1 (This I have kept as 1 because I have a > header > > in my file) > > Remove Trailing Newlines - true > > > > Splits - It flows to next processor “ConvertCSVToAvro” > > Original - I have created a processor Putfile and storing the file in a > > directory by name "D:\ApacheNiFi\processed-data". > > Failure - I am routing it to the same processor > > > > 1st question: > > Is it possible that we can attach some kind of an attribute to > distinguish > > every record that is split. For eg. Is it possible to attach some unique > ID > > to each record as an attribute to make it unique? If yes, how can I do > that? > > Is there any instructions or material available where it will help me to > add > > an attribute? I tried to add “UpdateAttribute” processor to check if I > can > > achieve this, but could not find anything related. > > > > 2nd question: > > I also need to check if the input string in each field of the record is > of > > 35 characters. Only then it should execute the “Split” relation. Else the > > record should be routed to failure. > > > > Any guidance will be very helpful. I hope I am not sounding very stupid. > > > > If there is any material for me to practise these kind of activities like > > validating based on some conditions or mentioning a filename for > capturing > > error records like "InvalidRecords.csv" in the folder mentioned in > putfile > > processor. Everything seems so confusing and I am not able to find enough > > material to learn this. > > > > Thanks for your patience and time > > > > Thanks > > Dave > > > > > > > > -- > > Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/ >
